KR102306053B1

KR102306053B1 - Method and apparatus for training language skills for older people using speech recognition model

Info

Publication number: KR102306053B1
Application number: KR1020200032273A
Authority: KR
Inventors: 신대진
Original assignee: 주식회사 이드웨어
Priority date: 2020-03-16
Filing date: 2020-03-16
Publication date: 2021-09-29
Also published as: KR20210115879A

Abstract

언어 훈련 방법 및 그 장치가 제공 된다. 본 발명의 일 실시예에 따라 음성 인식 장치에서 수행되는 사용자의 언어 훈련 방법은 상기 사용자의 이전 훈련 기록이 존재하지 않는 경우, 복수의 훈련 데이터 셋(set)에 대한 상기 사용자의 의미전형성 수치를 측정하는 단계, 상기 사용자의 이전 훈련 기록이 존재하는 경우, 상기 사용자의 의미전형성 수치를 기초로 타겟 훈련 데이터 셋을 생성하는 단계, 상기 타겟 훈련 데이터 셋을 이용한 사용자의 언어 훈련 결과를 기초로, 상기 타겟 훈련 데이터 셋에 대한 상기 사용자의 의미전형성 수치를 갱신하는 단계 및 상기 복수의 훈련 데이터 셋에 대한 복수의 사용자의 의미전형성 수치를 이용하여 상기 음성 인식 장치가 이용하는 음성 인식 모델을 갱신하는 단계를 포함할 수 있다.A language training method and apparatus are provided. According to an embodiment of the present invention, a method for training a user's language performed in a voice recognition device measures the semantic formation value of the user with respect to a plurality of training data sets when the user's previous training record does not exist. generating a target training data set based on the user's semantic formation value when there is a previous training record of the user; based on the user's language training result using the target training data set, the target Updating the user's semantic formation values for a training data set and updating the speech recognition model used by the speech recognition apparatus using the plurality of users' semantic formation values for the plurality of training data sets. can

Description

A language training method for the elderly using a speech recognition model and a device therefor

본 발명은 음성 인식 모델을 이용한 노년층 대상의 언어 훈련 방법 및 그 장치에 관한 것이다. 보다 상세하게 과학적 근거를 기반으로 노년층의 언어 능력을 향상시킬 수 있는 언어 훈련 컨텐츠를 제공하는 방법 및 그 장치에 관한 것이다.The present invention relates to a method and apparatus for language training for an elderly target using a speech recognition model. In more detail, it relates to a method and apparatus for providing language training content that can improve the language ability of the elderly based on scientific evidence.

세계 인구의 고령화에 따라 인지재활 분야의 시장 규모는 연 평균 32.3%로 성장하고 있으며, 2021년에는 약 80억 달러의 규모로 시장이 성장하고 있다. With the aging of the world population, the market size of the cognitive rehabilitation field is growing at an average annual rate of 32.3%, and the market is growing to about $8 billion in 2021.

또한, 기술의 발전과 컨텐츠의 다양화에 따라 다양한 두뇌 훈련과 관련된 컨텐츠가 개발되고 있으나, 두뇌 훈련 컨텐츠의 주된 사용자인 노년층을 위한 인터페이스 및 전용 컨텐츠는 개발되고 있지 않다.In addition, various contents related to brain training are being developed according to the development of technology and diversification of contents, but interfaces and dedicated contents for the elderly who are the main users of brain training contents have not been developed.

한편, 음성 인식 기술의 개발이 활발히 진행되고 있으나 노년층을 위한 컨텐츠 개발은 미비하며, 노년층의 인지능력 평가 또는 향상을 위한 컨텐츠는 전무한 실정이다. 이에 따라, 음성 인식 기술을 이용하여 의사소통에 기본이 되는 언어 능력을 훈련하기 위한 기술의 개발이 필요하다.On the other hand, although the development of voice recognition technology is actively progressing, the development of contents for the elderly is insufficient, and there is no content for evaluating or improving the cognitive ability of the elderly. Accordingly, there is a need to develop a technology for training language skills that are basic for communication using voice recognition technology.

본 발명의 실시예들은 노년층의 언어 능력을 평가하고 향상시키기 위한 컨텐츠 제공 방법 및 그 장치를 제공한다.Embodiments of the present invention provide a content providing method and apparatus for evaluating and improving the language ability of the elderly.

본 발명의 일 실시예에 따라 음성 인식 장치에서 수행되는 사용자의 언어 훈련 방법은 상기 사용자의 이전 훈련 기록이 존재하지 않는 경우, 복수의 훈련 데이터 셋(set)에 대한 상기 사용자의 의미전형성 수치를 측정하는 단계, 상기 사용자의 이전 훈련 기록이 존재하는 경우, 상기 사용자의 의미전형성 수치를 기초로 타겟 훈련 데이터 셋을 생성하는 단계, 상기 타겟 훈련 데이터 셋을 이용한 사용자의 언어 훈련 결과를 기초로, 상기 타겟 훈련 데이터 셋에 대한 상기 사용자의 의미전형성 수치를 갱신하는 단계 및 상기 복수의 훈련 데이터 셋에 대한 복수의 사용자의 의미전형성 수치를 이용하여 상기 음성 인식 장치가 이용하는 음성 인식 모델을 갱신하는 단계를 포함할 수 있다.According to an embodiment of the present invention, a method for training a user's language performed in a voice recognition device measures the semantic formation value of the user with respect to a plurality of training data sets when the user's previous training record does not exist. generating a target training data set based on the user's semantic formation value when there is a previous training record of the user; based on the user's language training result using the target training data set, the target Updating the user's semantic formation values for a training data set and updating the speech recognition model used by the speech recognition apparatus using the plurality of users' semantic formation values for the plurality of training data sets. can

일 실시예에서 상기 훈련 데이터 셋(set)은, 적어도 하나의 기준 단어 및 복수의 비교 단어로 구성된 하나 이상의 데이터 쌍을 포함하고, 상기 사용자의 의미전형성 수치를 측정하는 단계는, 대상 데이터 쌍에 포함된 복수의 비교 단어중 하나에 대한 사용자 선택을 기초로, 상기 사용자가 선택한 비교 단어에 대한 사용자의 의미전형성 수치를 측정하는 단계를 포함할 수 있다.In one embodiment, the training data set (set) includes one or more data pairs consisting of at least one reference word and a plurality of comparison words, and the step of measuring the semantic formation value of the user is included in the target data pair and measuring a semantic formation value of the user for the comparison word selected by the user based on the user's selection of one of the plurality of comparison words.

일 실시예에서 상기 훈련 데이터 셋에 포함된 하나 이상의 데이터 쌍은, 하나의 카테고리에 포함된 단어들을 포함하고, 상기 의미전형성 수치를 측정하는 단계는, 복수의 훈련 데이터 셋에 포함된 복수의 데이터 쌍 간의 상관 계수를 측정하는 단계를 더 포함할 수 있다.In an embodiment, the one or more data pairs included in the training data set includes words included in one category, and the measuring the semantic formation value includes a plurality of data pairs included in the plurality of training data sets. The method may further include measuring a correlation coefficient between the two.

일 실시예에서 상기 타겟 훈련 데이터 셋을 생성하는 단계는, 제1 타겟 훈련 데이터 셋 및 제2 타겟 훈련 데이터 셋을 생성하는 단계를 포함하고, 상기 제1 타겟 훈련 데이터 셋에 대한 상기 사용자의 의미전형성 수치와 상기 제2 타겟 훈련 데이터 셋에 대한 상기 사용자의 의미전형성 수치 간 편차는 지정된 값일 수 있다.In an embodiment, generating the target training data set includes generating a first target training data set and a second target training data set, and the user's semantic formation with respect to the first target training data set A deviation between the numerical value and the semantic formation value of the user with respect to the second target training data set may be a specified value.

일 실시예에서 상기 제1 타겟 훈련 데이터 셋에 대한 상기 사용자의 의미전형성 수치 및 상기 제2 타겟 훈련 데이터 셋에 대한 상기 사용자의 의미전형성 수치의 평균은 지정된 값일 수 있다.In an embodiment, the average of the user's semantic formation value for the first target training data set and the user's semantic formation value for the second target training data set may be a specified value.

일 실시예에서 상기 사용자의 의미전형성 수치를 갱신하는 단계는, 상기 제1 타겟 훈련 데이터 셋을 이용한 사용자의 언어 훈련 결과를 기초로 상기 사용자의 의미전형성 수치를 갱신한 후, 상기 제2 타겟 훈련 데이터 셋을 이용한 사용자의 언어 훈련 결과를 기초로 상기 사용자의 의미전형성 수치를 갱신하는 단계를 포함하고, 상기 제1 타겟 훈련 데이터 셋에 대한 사용자의 의미전형성 수치는 상기 제2 타겟 훈련 데이터 셋에 대한 사용자의 의미전형성 수치보다 크며, 상기 의미전형성 수치는 상기 타겟 훈련 데이터 셋에 포함된 단어간 전형성이 없을수록 증가하는 수치일 수 있다.In an embodiment, the updating of the user's semantic formation value may include updating the user's semantic formation value based on the user's language training result using the first target training data set, and then updating the user's semantic formation value using the second target training data. and updating the user's semantic formation value based on the user's language training result using the set, wherein the user's semantic formation value for the first target training data set is the user's for the second target training data set. is greater than the semantic formation value of , and the semantic formation value may be a numerical value that increases as there is no inter-word typicality included in the target training data set.

일 실시예에서 상기 음성 인식 모델은, 복수의 언어 각각에 대응하는 언어별 음성 인식 디코더를 포함하며, 상기 음성 인식 모델을 갱신하는 단계는, 상기 사용자가 사용하는 언어와 대응되는 음성 인식 디코더를 갱신하는 단계를 포함할 수 있다.In an embodiment, the speech recognition model includes a speech recognition decoder for each language corresponding to each of a plurality of languages, and the updating of the speech recognition model includes updating a speech recognition decoder corresponding to the language used by the user. may include the step of

본 발명의 다른 실시예에 따른 언어 훈련 장치는 사용자의 음성을 수신하는 입출력 인터페이스 및 프로세서를 포함하되, 상기 프로세서는, 상기 사용자의 이전 훈련 기록이 존재하지 않는 경우 복수의 훈련 데이터 셋(set)에 대한 상기 사용자의 의미전형성 수치를 측정하고, 상기 사용자의 이전 훈련 기록이 존재하는 경우, 상기 사용자의 의미전형성 수치를 기초로 타겟 훈련 데이터 셋을 생성하고, 상기 타겟 훈련 데이터 셋을 이용한 사용자의 언어 훈련 결과를 기초로, 상기 타겟 훈련 데이터 셋에 대한 상기 사용자의 의미전형성 수치를 갱신하고, 상기 복수의 훈련 데이터 셋에 대한 복수의 사용자의 의미전형성 수치를 이용하여 음성 인식 모델을 갱신할 수 있다.A language training apparatus according to another embodiment of the present invention includes an input/output interface for receiving a user's voice and a processor, wherein the processor includes a plurality of training data sets when there is no previous training record of the user. Measuring the semantic formation value of the user, if there is a previous training record of the user, generating a target training data set based on the user's semantic formation value Based on the result, the user's semantic formation values for the target training data set may be updated, and the speech recognition model may be updated using the plurality of users' semantic formation values for the plurality of training data sets.

도 1은 본 발명의 일 실시예에 따른 언어 훈련 방법을 수행하는 언어 훈련 시스템의 예를 도시한 도면이다.
도 2는 본 발명의 일 실시예에 따른 언어 훈련 방법을 수행하는 사용자 단말 및 서버의 내부 구성을 설명하기 위한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 언어 훈련 장치의 구성 및 동작을 설명하기 위한 블록도이다.
도 4는 본 발명의 일 실시예에 따른 언어 훈련 방법의 순서도이다.
도 5는 본 발명의 일 실시예에 따른 훈련 데이터의 일 예를 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 음성 인식 모델에 의해 단어를 인식하는 방법을 설명하기 위한 도면이다.
도 7 내지 도 14는 본 발명의 일 실시예에 따른 언어 훈련 방법을 수행하는 사용자 단말의 인터페이스의 예시도이다.
도 15 및 도 16은 본 발명의 일 실시예에 따른 언어 훈련 결과의 관리 페이지의 예시도이다.1 is a diagram illustrating an example of a language training system for performing a language training method according to an embodiment of the present invention.
2 is a block diagram illustrating the internal configuration of a user terminal and a server performing a language training method according to an embodiment of the present invention.
3 is a block diagram for explaining the configuration and operation of a language training apparatus according to an embodiment of the present invention.
4 is a flowchart of a language training method according to an embodiment of the present invention.
5 is a diagram for explaining an example of training data according to an embodiment of the present invention.
6 is a diagram for explaining a method of recognizing a word by a voice recognition model according to an embodiment of the present invention.
7 to 14 are exemplary views of an interface of a user terminal performing a language training method according to an embodiment of the present invention.
15 and 16 are exemplary views of a management page of a language training result according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이러한 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 본 명세서에 기재되어 있는 특정 형상, 구조 및 특성은 본 발명의 정신과 범위를 벗어나지 않으면서 일 실시예로부터 다른 실시예로 변경되어 구현될 수 있다. 또한, 각각의 실시예 내의 개별 구성요소의 위치 또는 배치도 본 발명의 정신과 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 행하여지는 것이 아니며, 본 발명의 범위는 특허청구범위의 청구항들이 청구하는 범위 및 그와 균등한 모든 범위를 포괄하는 것으로 받아들여져야 한다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 구성요소를 나타낸다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS [0010] DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS [0010] DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS [0023] Reference is made to the accompanying drawings, which show by way of illustration specific embodiments in which the present invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different but need not be mutually exclusive. For example, certain shapes, structures, and characteristics described herein may be implemented with changes from one embodiment to another without departing from the spirit and scope of the present invention. In addition, it should be understood that the location or arrangement of individual components within each embodiment may be changed without departing from the spirit and scope of the present invention. Accordingly, the following detailed description is not to be taken in a limiting sense, and the scope of the present invention should be taken as encompassing the scope of the claims and all equivalents thereto. In the drawings, like reference numerals refer to the same or similar elements throughout the various aspects.

이하에서는, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 여러 실시예에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, various embodiments of the present invention will be described in detail with reference to the accompanying drawings in order to enable those of ordinary skill in the art to easily practice the present invention.

이하 도 1을 참조하여 본 발명의 일 실시예에 따른 언어 훈련 방법을 수행하는 언어 훈련 시스템에 대하여 상세히 설명한다.Hereinafter, a language training system for performing a language training method according to an embodiment of the present invention will be described in detail with reference to FIG. 1 .

본 발명의 일 실시예에 따른 언어 훈련 방법은 사용자 단말(100a, 100b, 100c)에서 수행될 수 있고, 몇몇 실시예에서 서버(110)에서 수행될 수도 있으며, 적어도 하나의 단계는 사용자 단말(100a, 100b, 100c)에서 수행되고, 적어도 하나의 단계는 서버(110)에서 수행될 수 있다. 이 경우 사용자 단말(100a, 100b, 100c)과 서버(110)는 네트워크(10)를 통해 통신이 수행될 수 있다.The language training method according to an embodiment of the present invention may be performed in the user terminals 100a, 100b, 100c, and in some embodiments, may be performed in the server 110, and at least one step is performed by the user terminal 100a. , 100b, 100c), and at least one step may be performed in the server 110 . In this case, the user terminals 100a, 100b, 100c and the server 110 may communicate through the network 10 .

몇몇 실시예에 따라 언어 훈련 방법을 수행하는 사용자 단말(110a, 100b, 100c)은 마이크 모듈을 통해 사용자의 음성 데이터를 획득할 수 있고, 서버(110)는 네트워크(10)를 통해 사용자의 음성 데이터를 획득할 수 있다. 사용자의 음성 데이터를 획득한 사용자 단말(110a, 100b, 100c) 또는 서버(110)는 음성 인식 모델을 이용하여 획득한 사용자의 음성 데이터를 인식할 수 있다. According to some embodiments, the user terminals 110a, 100b, and 100c performing the language training method may acquire the user's voice data through a microphone module, and the server 110 may obtain the user's voice data through the network 10 . can be obtained. The user terminals 110a, 100b, 100c or the server 110 that have acquired the user's voice data may recognize the acquired user's voice data using a voice recognition model.

일 실시예에서 사용자의 이전 훈련 기록이 존재하지 않는 경우, 복수의 훈련 데이터 셋(set)에 대한 사용자의 의미전형성 수치가 측정될 수 있다. 또한, 선택적 실시예에서 사용자의 이전 훈련 기록이 존재하는 경우, 사용자의 의미전형성 수치를 기초로 타겟 훈련 데이터 셋을 생성할 수 있다.In an embodiment, when the user's previous training record does not exist, the semantic formation value of the user for a plurality of training data sets may be measured. Also, in an optional embodiment, when a user's previous training record exists, a target training data set may be generated based on the user's semantic formation value.

이후 사용자 단말(110a, 100b, 100c) 또는 서버(110)는 상기 타겟 훈련 데이터 셋을 이용한 사용자의 언어 훈련 결과를 기초로, 상기 타겟 훈련 데이터 셋에 대한 상기 사용자의 의미전형성 수치를 갱신할 수 있다. 그리고 복수의 훈련 데이터 셋에 대한 복수의 사용자의 의미전형성 수치를 이용하여 상기 음성 인식 장치가 이용하는 음성 인식 모델을 갱신할 수 있다.Thereafter, the user terminals 110a, 100b, 100c or the server 110 may update the user's semantic formation value for the target training data set based on the user's language training result using the target training data set. . In addition, the speech recognition model used by the speech recognition apparatus may be updated by using the semantic formation values of the plurality of users for the plurality of training data sets.

일 실시예에서 디스플레이를 포함하는 사용자 단말(110a, 100b)의 경우 훈련 데이터 셋을 이용하여 언어 훈련을 수행함에 있어 사용자 인터페이스를 더 제공할 수 있다. 사용자 인터페이스에 대한 상세한 설명은 이하 도 7 내지 도 16에서 상세히 설명한다.In the case of the user terminals 110a and 100b including a display according to an embodiment, a user interface may be further provided when language training is performed using a training data set. A detailed description of the user interface will be described in detail below with reference to FIGS. 7 to 16 .

이하 도 2를 참조하여 본 발명의 일 실시예에 따른 언어 훈련 장치의 내부 구성에 대하여 상세히 설명한다. 이하 사용자 단말(100) 및 서버(110)를 언어 훈련 장치의 일 예로 가정한다. 이 경우, 일 실시예에 따른 언어 수행 방법은 사용자 단말(100)에서 수행될 수 있으나, 다른 실시예에에 따른 언어 수행 방법은 서버(110)에서 수행될 수도 있고, 또 다른 실시예에 따른 언어 수행 방법은 사용자 단말(100) 및 서버(110)에 의해 수행될 수 있다.Hereinafter, the internal configuration of the language training apparatus according to an embodiment of the present invention will be described in detail with reference to FIG. 2 . Hereinafter, it is assumed that the user terminal 100 and the server 110 are examples of the language training apparatus. In this case, the language performing method according to an embodiment may be performed in the user terminal 100, but the language performing method according to another embodiment may be performed in the server 110, or a language performing method according to another embodiment. The performing method may be performed by the user terminal 100 and the server 110 .

이하, 본 발명의 일 실시예에 따른 언어 훈련 장치에 대한 예로서 사용자 단말(100) 및 서버(110)의 내부 구성에 대하여 상세히 설명한다. Hereinafter, as an example of a language training apparatus according to an embodiment of the present invention, the internal configuration of the user terminal 100 and the server 110 will be described in detail.

일 실시예에서 언어 훈련 방법을 수행하는 사용자 단말(100)은 메모리(101), 프로세서(102), 통신 모듈(103), 입출력 인터페이스(104) 및 마이크 모듈(105)을 포함할 수 있다. 또한 일 실시예에서 언어 훈련 방법을 수행하는 서버(110)는 메모리(111), 프로세서(112), 통신 모듈(113) 및 입출력 인터페이스(114)를 포함할 수 있다.In an embodiment, the user terminal 100 for performing the language training method may include a memory 101 , a processor 102 , a communication module 103 , an input/output interface 104 , and a microphone module 105 . Also, according to an embodiment, the server 110 performing the language training method may include a memory 111 , a processor 112 , a communication module 113 , and an input/output interface 114 .

메모리(101, 111)는 컴퓨터에서 판독 가능한 기록 매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 또한, 메모리(101)에는 사용자 단말(100)을 제어하기 위한 프로그램 코드 및 설정, 음성 인식 모델, 사용자 정보 및 훈련 데이터가 일시적 또는 영구적으로 저장될 수 있다. 다른 실시예에서 서버(110)의 메모리(111)에 음성 인식 모델, 사용자 정보 및 훈련 데이터가 저장될 수 있음은 물론이다.The memories 101 and 111 are computer-readable recording media and may include random access memory (RAM), read only memory (ROM), and permanent mass storage devices such as disk drives. In addition, the memory 101 may temporarily or permanently store program codes and settings for controlling the user terminal 100 , a voice recognition model, user information, and training data. Of course, the voice recognition model, user information, and training data may be stored in the memory 111 of the server 110 in another embodiment.

프로세서(102, 112)는 기본적인 산술, 로직 및 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 일 실시예에서 컴퓨터 프로그램의 명령은 메모리(101) 또는 통신 모듈(103)에 의해 프로세서(102)로 제공될 수 있다. 예를 들어 프로세서(102)는 메모리(101)와 같은 기록 장치에 저장된 프로그램 코드에 따라 수신되는 명령을 실행하도록 구성될 수 있다.The processors 102 and 112 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. In one embodiment, instructions of the computer program may be provided to the processor 102 by the memory 101 or the communication module 103 . For example, the processor 102 may be configured to execute instructions received according to program code stored in a recording device, such as the memory 101 .

통신 모듈(103)은 네트워크(10)를 통해 서버(110)와 통신하기 위한 기능을 제공할 수 있다. 일례로, 사용자 단말(100)의 프로세서(102)가 메모리(101)와 같은 기록 장치에 저장된 프로그램 코드에 따라 생성한 요청이 통신 모듈(103)의 제어에 따라 네트워크(10)를 통해 서버(110)로 전달될 수 있다. 역으로, 서버(110)의 프로세서(112)의 제어에 따라 제공되는 제어 신호나 명령, 컨텐츠, 파일 등이 네트워크(10)를 거쳐 통신 모듈(103)을 통해 사용자 단말(100)로 수신될 수 있다. 예를 들어 통신 모듈(103)을 통해 수신된 서버(110)의 제어 신호나 명령 등은 프로세서(102)나 메모리(101)로 전달될 수 있고, 컨텐츠나 파일 등은 사용자 단말(100)이 더 포함할 수 있는 저장 매체로 저장될 수 있다. 또한, 통신 모듈(103)은 서버(110)와 네트워크(10)를 통해 통신할 수 있다. 통신 방식은 제한되지 않지만, 네트워크(10)는 근거리 무선통신망일 수 있다. 예를 들어, 네트워크(10)는 블루투스(Bluetooth), BLE(Bluetooth Low Energy), Wifi 통신망일 수 있다. The communication module 103 may provide a function for communicating with the server 110 through the network 10 . For example, a request generated by the processor 102 of the user terminal 100 according to a program code stored in a recording device such as the memory 101 is transmitted to the server 110 through the network 10 under the control of the communication module 103 . ) can be transferred. Conversely, a control signal, command, content, file, etc. provided under the control of the processor 112 of the server 110 may be received by the user terminal 100 through the communication module 103 via the network 10 . have. For example, a control signal or command of the server 110 received through the communication module 103 may be transmitted to the processor 102 or the memory 101 , and the content or file may be transmitted to the user terminal 100 by the user terminal 100 . It may be stored in a storage medium that may include. In addition, the communication module 103 may communicate with the server 110 through the network 10 . Although the communication method is not limited, the network 10 may be a local area wireless network. For example, the network 10 may be a Bluetooth (Bluetooth), BLE (Bluetooth Low Energy), or Wifi communication network.

입출력 인터페이스(104)는 사용자의 입력을 수신하고, 출력 데이터를 디스플레이 할 수 있다. 일 실시예에 따른 입출력 인터페이스(104)는 사용자로부터 사용자 정보 또는 언어 훈련 컨텐츠에 대한 피드백을 입력 받을 수 있고, 디스플레이에 훈련 데이터를 이용한 언어 훈련 컨텐츠를 표시할 수 있다.The input/output interface 104 may receive a user's input and display output data. The input/output interface 104 according to an embodiment may receive user information or feedback on language training content from a user, and may display language training content using the training data on a display.

마이크 모듈(105)은 마이크 장치와 연결하기 위한 인터페이스일 수 있다. 일 실시예에서 사용자 단말(100)의 마이크 모듈(105)에서 획득한 사용자 음성 데이터가 메모리(101)로 전달될 수 있다.The microphone module 105 may be an interface for connecting to a microphone device. In an embodiment, user voice data obtained from the microphone module 105 of the user terminal 100 may be transferred to the memory 101 .

또한, 다른 실시예들에서 사용자 단말(100) 및 서버(110)는 도 2의 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 예를 들어, 사용자 단말(100)은 사용자 단말의 내부 구성요소들에 전력을 공급하는 배터리 및 충전 장치를 포함할 수 있고, 상술한 입출력 장치 중 적어도 일부를 포함하도록 구현되거나 또는 트랜시버(transceiver), GPS(Global Positioning System) 모듈, 각종 센서, 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수도 있다.In addition, in other embodiments, the user terminal 100 and the server 110 may include more components than those of FIG. 2 . However, there is no need to clearly show most of the prior art components. For example, the user terminal 100 may include a battery and a charging device for supplying power to the internal components of the user terminal, and is implemented to include at least some of the above-described input/output devices or a transceiver, Other components such as a Global Positioning System (GPS) module, various sensors, and databases may be further included.

또한, 비록 도 2에는 도시되지 않았지만 일 실시예에 따라 서버(110)가 언어 훈련 방법을 수행하는 경우 서버(110)도 마이크 모듈(105)을 더 포함할 수 있음은 물론이다.In addition, although not shown in FIG. 2 , according to an embodiment, when the server 110 performs a language training method, the server 110 may further include a microphone module 105 .

이하 도 3을 참조하여 서버에서 언어 훈련 방법이 수행되는 경우, 서버의 하드웨어 동작을 설명하기 위한 도면이다. 이하, 언어 훈련 장치의 일 예로서 서버를 이용하여 도 3에 도시된 서버를 언어 훈련 장치(110)를 설명한다.Hereinafter, when the language training method is performed in the server with reference to FIG. 3 , it is a diagram for explaining the hardware operation of the server. Hereinafter, the language training apparatus 110 will be described using the server shown in FIG. 3 as an example of the language training apparatus.

일 실시예에서 언어 훈련 장치(110)는 음성 인식부(120) 및 언어 훈련부(130)를 포함할 수 있다. In an embodiment, the language training apparatus 110 may include a voice recognition unit 120 and a language training unit 130 .

음성 인식부(120)는 언어별 음성 DB(121), 음성 DB(122), 베이스라인 음성 인식 모델(123) 및 언어별 음성 인식 모델(124)을 포함할 수 있다.The voice recognition unit 120 may include a voice DB 121 for each language, a voice DB 122 , a baseline voice recognition model 123 , and a voice recognition model 124 for each language.

일 실시예에서 음성 DB(122)는 사용자의 음성(비 장애음)과 장애음을 포함하는 데이터를 저장할 수 있다. 따라서 음성 DB(122)로부터 데이터를 획득한 베이스라인 음성 인식 모델(123)은 사용자의 음성과 장애음 간의 가능도(likelihood) 비교를 수행할 수 있다. 다른 실시예에서 베이스라인 음성 인식 모델은 사용자의 음성 인식을 위한 모델 및 장애음을 추출하기 위한 모델을 포함할 수 있다. According to an embodiment, the voice DB 122 may store data including the user's voice (non-disruptive sound) and the disabled sound. Accordingly, the baseline voice recognition model 123 obtained by acquiring data from the voice DB 122 may perform a likelihood comparison between the user's voice and the obstacle sound. In another embodiment, the baseline voice recognition model may include a model for recognizing a user's voice and a model for extracting an obstacle sound.

한편, 사용자 마다 사용하는 언어가 상이할 수 있으므로, 본 발명의 몇몇 실시예에 따른 언어 훈련 장치(110)에 포함된 음성 인식부(120)는 언어별 음성 DB(121)를 더 포함할 수 있다. 일 실시예에 따른 언어별 음성 DB(121)는 언어별 사용자의 발화 특성을 반영한 음성 데이터를 저장할 수 있다. 따라서 음성 인식 장치(110)는 언어별 음성 DB(121)를 이용하여 베이스라인 음성 인식 모델(123) 적응을 수행할 수 있다. 이에 따라 본 발명의 일 실시예에서 베이스라인 음성 인식 모델(123)을 기초로 언어별 음성 인식 모델(124)이 생성될 수도 있다. 하지만 언어마다 상이한 음성 인식 모델이 생성되는 경우, 복수의 언어 각각의 음성 인식 모델을 개발하기 위한 과도한 개발 비용이 필요하다는 문제가 발생한다. 따라서 다른 실시예에 따른 언어 훈련 장치(110)는 복수의 언어 각각에 대응되는 언어 훈련부(130)를 포함할 수 있다. Meanwhile, since the language used by each user may be different, the voice recognition unit 120 included in the language training apparatus 110 according to some embodiments of the present invention may further include a voice DB 121 for each language. . The voice DB 121 for each language according to an embodiment may store voice data reflecting the user's speech characteristics for each language. Accordingly, the voice recognition apparatus 110 may perform the adaptation of the baseline voice recognition model 123 using the voice DB 121 for each language. Accordingly, in an embodiment of the present invention, a speech recognition model 124 for each language may be generated based on the baseline speech recognition model 123 . However, when different speech recognition models are generated for each language, there arises a problem that excessive development cost is required to develop a speech recognition model for each of a plurality of languages. Accordingly, the language training apparatus 110 according to another embodiment may include the language training unit 130 corresponding to each of a plurality of languages.

언어 훈련부(130)는 인터페이스(131), 전처리부(132), 훈련 데이터 DB(133), 음소 변환부(134), 음소 분리부(135) 및 발화 평가부(136)를 포함할 수 있다. 몇몇 실시예에 따라 언어별로 생성된 각각의 언어 훈련부(130)는 사용자의 음성을 수신하여 음성 데이터를 디코딩 한 후, 사용자의 언어 능력을 평가할 수 있다. 사용자의 언어 능력은 후술되는 의미전형성 수치를 기초로 평가될 수 있다. 복수의 언어 각각에 대하여 음성 인식 모델을 생성하는 대신 디코더(예를 들어, 본 실시예에 따른 언어 훈련부(130))를 생성함으로써 언어별 음성 인식 모델 생성에 소요되는 비용을 절약할 수 있다. 본 실시예에서 언어 훈련부(130)는 언어 훈련을 수행하기 위한 디코더로 ERN(Extended Recognition Network)방식에 이용될 수 있다.The language training unit 130 may include an interface 131 , a preprocessor 132 , a training data DB 133 , a phoneme conversion unit 134 , a phoneme separation unit 135 , and a speech evaluation unit 136 . According to some embodiments, each language training unit 130 generated for each language may receive the user's voice, decode the voice data, and then evaluate the user's language ability. The user's language ability may be evaluated based on a semantic formation value to be described later. By generating a decoder (eg, the language training unit 130 according to the present embodiment) instead of generating a speech recognition model for each of a plurality of languages, it is possible to reduce the cost of generating a speech recognition model for each language. In this embodiment, the language training unit 130 is a decoder for performing language training and may be used in an Extended Recognition Network (ERN) method.

언어 훈련 장치(110)는 인터페이스(131)를 통해 언어 훈련을 수행할 사용자의 음성 데이터를 획득할 수 있다. 이후, 언어 훈련 장치(110)는 전처리부(132)에서 음성 데이터의 잡음을 제거하고 사용자의 발화 내용만을 추출할 수 있다. The language training apparatus 110 may acquire voice data of a user who will perform language training through the interface 131 . Thereafter, the language training apparatus 110 may remove noise from the voice data in the preprocessor 132 and extract only the content of the user's utterance.

훈련 데이터 DB(133)에는 사용자의 언어 훈련을 위한 훈련 데이터 셋(set)이 저장될 수 있다. 훈련 데이터 셋(set)은 적어도 하나의 기준 단어 및 복수의 비교 단어로 구성된 하나 이상의 데이터 쌍을 포함할 수 있다. 또한 훈련 데이터 셋에 포함된 하나 이상의 데이터 쌍은 하나의 카테고리에 포함된 단어들을 포함할 수 있다. 또한, 음소 변환부(134)는 훈련 데이터 셋에 포함된 단어들에 대한 음소 변환(G2P; Grapheme-to-Phoneme)을 수행할 수 있다.The training data DB 133 may store a training data set for training a user's language. The training data set may include one or more data pairs including at least one reference word and a plurality of comparison words. In addition, one or more data pairs included in the training data set may include words included in one category. Also, the phoneme conversion unit 134 may perform phoneme conversion (G2P) on words included in the training data set.

음소 분리부(135)는 음성 데이터에서 추출한 사용자 발화 내용에 대한 음소 분리를 수행할 수 있다. 또한 음소 분리부(135)는 음소 변환된 훈련 데이터에 대한 음소 분리도 수행할 수 있다.The phoneme separation unit 135 may perform phoneme separation on contents of a user's utterance extracted from voice data. In addition, the phoneme separation unit 135 may also perform phoneme separation on the phoneme-converted training data.

이후 발화 평가부(136)는 훈련 데이터와 사용자의 음성 데이터를 비교하여 사용자의 언어 훈련 결과를 평가할 수 있다. 보다 상세하게 발화 평가부(136)는 일 실시예에서 사용자의 이전 훈련 기록이 존재하지 않는 경우 훈련 데이터 DB(133)에 포함된 단어에 대한 사용자의 의미전형성 수치를 측정할 수 있다. 의미전형성은 의미범주(semantic category)에 속하는 본보기(exemplar)가 그 의미 범주를 대표하는 전형적 의미자질(semantic feature)을 어느 정도 지니고 있는지에 관한 지표이다(Kiran, 2008; Kiran & Thomp-son, 2003). 즉, 특정 단어에 대한 사용자의 사용 빈도가 높을수록, 사용자에게 친근한 단어일수록 특정 단어에 대한 사용자의 의미전형성은 높다. Thereafter, the speech evaluation unit 136 may evaluate the user's language training result by comparing the training data with the user's voice data. In more detail, the utterance evaluation unit 136 may measure the semantic formation value of the user with respect to the word included in the training data DB 133 when there is no previous training record of the user. Semantic typicality is an indicator of the extent to which an exemplar belonging to a semantic category has typical semantic features representing the semantic category (Kiran, 2008; Kiran & Thomp-son, 2003). ). That is, the higher the frequency of the user's use of the specific word and the more familiar the word to the user, the higher the user's semantic formation for the specific word.

또한 발화 평가부(136)는 일 실시예에서 사용자의 이전 훈련 기록이 존재하는 경우 타겟 훈련 데이터에 대한 사용자의 언어 훈련 결과를 평가할 수 있고, 언어 훈련 결과를 기초로 타겟 훈련 데이터에 대한 사용자의 의미전형성 수치를 갱신할 수 있다. 이후, 발화 평가부(136)는 사용자의 언어 훈련 결과를 정규화 하여 저장 및 관리할 수 있다. 본 발명의 일 실시예에 따른 언어 훈련 장치(110)는 복수의 훈련 데이터 셋 중 사용자에게 의미전형성이 낮은 순서대로 타겟 훈련 데이터를 선정할 수 있다. 타겟 훈련 데이터에 대한 사용자의 의미전형성 수치를 측정하는 방법은 후술한다.In addition, the speech evaluation unit 136 may evaluate the user's language training result with respect to the target training data when the user's previous training record exists in an embodiment, and the user's meaning of the target training data based on the language training result You can update the typicality value. Thereafter, the speech evaluation unit 136 may store and manage the normalized result of the user's language training. The language training apparatus 110 according to an embodiment of the present invention may select target training data in an order of low semantic typicality to a user from among a plurality of training data sets. A method of measuring a user's semantic formation value for the target training data will be described later.

또 다른 실시예에서 언어 훈련부(130)는 음성 인식부(120)에 포함된 음성 인식 모델을 이용하여 음성 데이터에 포함된 사용자 발화 내용을 추출할 수 있다. 또한, 음성 인식부(120)에 포함된 음성 인식 모델은 언어 훈련부(130)에서 측정한 복수의 사용자의 의미전형성 수치를 기초로 갱신될 수 있다. 보다 상세하게 특정 단어에 대한 의미전형성이 낮은 사용자의 음성 데이터의 가중치보다 특정 단어에 대한 의미전형성이 높은 사용자의 음성 데이터의 가중치를 높게 설정하여 상기 음성 인식 모델을 학습 시킬 수 있다. 특정 단어에 대한 의미전형성이 높은 사용자가 전술한 특정 단어를 더 정확히 인식하고 발화할 가능성이 높기 때문이다.In another embodiment, the language training unit 130 may extract the contents of the user's utterance included in the voice data by using the voice recognition model included in the voice recognition unit 120 . Also, the voice recognition model included in the voice recognition unit 120 may be updated based on the semantic formation values of a plurality of users measured by the language training unit 130 . In more detail, the voice recognition model may be trained by setting a weight of the user's voice data having high semantic formation for a specific word to be higher than the weight of the user's voice data having low semantic formation for a specific word. This is because a user with high semantic formation for a specific word is more likely to recognize and utter the above-described specific word more accurately.

도 4는 본 발명의 일 실시예에 따른 언어 훈련 방법의 순서도이다.4 is a flowchart of a language training method according to an embodiment of the present invention.

단계 S110에서 언어 훈련을 수행하는 사용자 정보가 획득될 수 있다.In step S110, user information for performing language training may be obtained.

단계 S120에서 사용자의 이전 훈련 기록이 존재하는지 여부가 확인될 수 있다.In step S120, it may be checked whether the user's previous training record exists.

일 실시예에서 사용자의 이전 훈련 기록이 존재하지 않는 경우, 단계 S130에서 복수의 훈련 데이터 셋에 대한 사용자의 의미전형성 수치가 측정될 수 있다.In an embodiment, when the user's previous training record does not exist, the semantic formation value of the user for a plurality of training data sets may be measured in step S130.

보다 상세하게 언어 훈련 장치는 대상 데이터 쌍에 포함된 복수의 비교 단어 중 하나에 대한 사용자 선택을 기초로, 상기 사용자가 선택한 비교 단어에 대한 사용자의 의미전형성 수치를 측정할 수 있다.In more detail, the language training apparatus may measure the semantic formation value of the user with respect to the comparison word selected by the user based on the user's selection of one of a plurality of comparison words included in the target data pair.

일 실시예에서 특정 단어의 의미전형성 수치는 복수의 훈련 데이터 셋에 포함된 복수의 데이터 쌍 간의 상관 계수를 기초로 정해질 수 있다. 보다 상세하게, 복수의 사용자의 언어 훈련 결과에 따라 복수의 훈련 데이터 셋 간의 상관 계수가 측정될 수 있다. 따라서, 제1 훈련 데이터 셋에 대한 사용자의 의미전형성 수치가 측정된 경우, 제1 훈련 데이터 셋과 제2 훈련 데이터 셋간 상관 계수를 이용하여 해당 사용자의 제2 훈련 데이터 셋에 대한 의미전형성 수치를 측정할 수 있다. 이를 통해 모든 훈련 데이터 셋을 이용하여 사용자를 평가하지 않더라도 특정 훈련 데이터 셋에 대한 사용자의 예상 의미전형성 수치를 추측할 수 있다. 일 실시예에서 훈련 데이터 셋은 적어도 하나의 기준 단어 및 복수의 비교 단어로 구성된 하나 이상의 데이터 쌍을 포함할 수 있다. 훈련 데이터 셋에 포함된 하나 이상의 데이터 쌍은 하나의 카테고리에 포함된 단어들을 포함할 수 있다.In an embodiment, the semantic formation value of a specific word may be determined based on a correlation coefficient between a plurality of data pairs included in a plurality of training data sets. In more detail, correlation coefficients between the plurality of training data sets may be measured according to the language training results of the plurality of users. Therefore, when the semantic formation value of the user with respect to the first training data set is measured, the semantic formation value for the user's second training data set is measured using the correlation coefficient between the first training data set and the second training data set. can do. Through this, even if the user is not evaluated using all training data sets, the expected semantic formation value of the user for a specific training data set can be inferred. In an embodiment, the training data set may include one or more data pairs comprising at least one reference word and a plurality of comparison words. One or more data pairs included in the training data set may include words included in one category.

선택적 실시예에서 사용자의 이전 훈련 기록이 존재하는 경우, 단계 S140에서 사용자의 의미전형성 수치를 기초로 타겟 훈련 데이터 셋을 생성할 수 있다.In an optional embodiment, when the user's previous training record exists, a target training data set may be generated based on the user's semantic formation value in step S140.

일 예로 제1 타겟 훈련 데이터 셋 및 제2 타겟 훈련 데이터 셋이 생성 되는 경우, 상기 제1 타겟 훈련 데이터 셋에 대한 상기 사용자의 의미전형성 수치와 상기 제2 타겟 훈련 데이터 셋에 대한 상기 사용자의 의미전형성 수치 간 편차는 지정된 값일 수 있다. 다른 예로 상기 제1 타겟 훈련 데이터 셋에 대한 상기 사용자의 의미전형성 수치 및 상기 제2 타겟 훈련 데이터 셋에 대한 상기 사용자의 의미전형성 수치의 평균은 지정된 값일 수도 있다. 이와 같이 본 발명의 몇몇 실시예에 따른 언어 훈련 장치는 지정된 기준에 따라 다양한 훈련 데이터 셋을 이용하여 사용자의 언어 훈련을 수행할 수 있다.For example, when the first target training data set and the second target training data set are generated, the user's semantic formation value for the first target training data set and the user's semantic formation for the second target training data set The deviation between values may be a specified value. As another example, the average of the user's semantic formation value for the first target training data set and the user's semantic formation value for the second target training data set may be a specified value. As described above, the language training apparatus according to some embodiments of the present invention may perform language training of a user using various training data sets according to specified criteria.

또 다른 실시예에서 언어 훈련 장치는 복수의 타겟 훈련 데이터 셋 중, 사용자의 의미전형성이 낮은 순서대로 언어 훈련을 수행할 수 있다. 이와 같이 사용자의 의미전형성이 낮은 순서로 복수의 훈련 데이터 셋을 이용하여 언어 훈련을 수행하는 것이, 사용자의 의미전형성이 높은 순서로 복수의 훈련 데이터 셋을 이용하여 언어 훈련을 수행하는 경우보다 사용자의 언어 능력 향상에 유리하다.According to another embodiment, the language training apparatus may perform language training in the order of the user's semantic formability among the plurality of target training data sets. As such, performing language training using a plurality of training data sets in an order of low semantic formation of the user is more effective than performing language training using a plurality of training data sets in an order of high semantic formation of the user. It is beneficial for improving language skills.

이후 단계 S150에서 타겟 훈련 데이터 셋을 이용한 사용자의 언어 훈련 결과를 기초로, 상기 타겟 훈련 데이터 셋에 대한 사용자의 의미전형성 수치가 갱신될 수 있다.Afterwards, in step S150, based on the result of the user's language training using the target training data set, the user's semantic formation value for the target training data set may be updated.

일 실시예에서 상기 제1 타겟 훈련 데이터 셋을 이용한 사용자의 언어 훈련 결과를 기초로 상기 사용자의 의미전형성 수치를 갱신한 후, 상기 제2 타겟 훈련 데이터 셋을 이용한 사용자의 언어 훈련 결과를 기초로 상기 사용자의 의미전형성 수치를 갱신할 수 있다. 전술한 바 사용자의 의미전형성이 낮은 순서로 복수의 훈련 데이터 셋을 이용하여 언어 훈련을 수행하는 것이 사용자의 언어 능력 향상에 유리하므로, 제1 타겟 훈련 데이터 셋에 대한 사용자의 의미전형성 수치가 상기 제2 타겟 훈련 데이터 셋에 대한 사용자의 의미전형성 수치보다 큰 것에 유의한다In an embodiment, after updating the semantic formation value of the user based on the user's language training result using the first target training data set, based on the user's language training result using the second target training data set, the The user's semantic formation value can be updated. As described above, since it is advantageous to improve the user's language ability to perform language training using a plurality of training data sets in the order of the user's low semantic formation, the user's semantic formation value for the first target training dataset is 2 Note that it is larger than the user's semantic formation value for the target training data set.

단계 S160에서 복수의 훈련 데이터 셋에 대한 복수의 사용자의 의미전형성 수치가 획득될 수 있다.In step S160, semantic formation values of a plurality of users for a plurality of training data sets may be obtained.

단계 S170에서 복수의 훈련 데이터 셋에 대한 복수의 사용자의 의미전형성 수치를 기초로 음성 인식 모델이 갱신될 수 있다. 일 실시예에서 음성 인식 모델은 복수의 언어 각각에 대응하는 언어별 음성 인식 디코더를 포함할 수 있다. 일 실시예에서 언어 훈련 장치는 특정 단어에 대한 의미전형성이 낮은 사용자의 음성 데이터의 가중치보다 특정 단어에 대한 의미전형성이 높은 사용자의 음성 데이터의 가중치를 높게 설정하여 상기 음성 인식 모델을 학습 시킬 수 있다.In step S170, the speech recognition model may be updated based on the semantic formation values of the plurality of users for the plurality of training data sets. In an embodiment, the speech recognition model may include a speech recognition decoder for each language corresponding to each of a plurality of languages. In an embodiment, the language training apparatus may train the speech recognition model by setting a weight of the user's voice data having high semantic formation for a specific word to be higher than that of the user's voice data having low semantic formation for a specific word. .

이하, 도 5를 참조하여 본 발명의 일 실시예에서 복수의 훈련 데이터 셋을 이용하여 생성된 언어 훈련 프로토콜에 대하여 상세히 설명한다.Hereinafter, a language training protocol generated using a plurality of training data sets in an embodiment of the present invention will be described in detail with reference to FIG. 5 .

일 실시예에서 언어 훈련 장치는 복수의 언어 훈련 컨텐츠를 포함하는 언어 훈련 프로토콜(200)을 생성할 수 있다. 언어 훈련 컨텐츠 각각은 적어도 하나 이상의 훈련 데이터 셋을 이용하여 생성될 수 있다. In an embodiment, the language training apparatus may generate the language training protocol 200 including a plurality of language training contents. Each of the language training contents may be generated using at least one or more training data sets.

일 실시예에서 복수의 언어 훈련 컨텐츠에서 이용되는 훈련 데이터 셋은 상이할 수 있다. 예를 들어 도 5에 도시된 바를 참조할 때, 1번 컨텐츠인 "그림보고 명사 이름대기"와 4번 컨텐츠인 "동사 이름대기"에 이용되는 훈련 데이터 셋은 각각 상이할 수 있고, 이 경우 1번 컨텐츠에 이용되는 제1 훈련 데이터 셋에 대한 사용자의 예상 의미전형성 수치와 2번 컨텐츠에 이용되는 제2 훈련 데이터 셋에 대한 사용자의 예상 의미전형성 수치간 편차는 지정된 수치일 수 있다. 다른 예에서 1번 컨텐츠에 이용되는 제1 훈련 데이터 셋에 대한 사용자의 예상 의미전형성 수치와 2번 컨텐츠에 이용되는 제2 훈련 데이터 셋에 대한 사용자의 예상 의미전형성 수치간 평균이 지정된 수치일 수도 있음은 물론이다.In an embodiment, training data sets used in a plurality of language training contents may be different. For example, when referring to the bar shown in FIG. 5 , the training data sets used for the first content "waiting for a noun by looking at pictures" and the fourth content, "waiting for a verb name" may be different, respectively, in this case 1 A deviation between the user's expected semantic formation value for the first training data set used for content No. 2 and the user's expected semantic formation value for the second training data set used for content No. 2 may be a specified value. In another example, the average between the user's expected semantic formation value for the first training data set used for the first content and the user's expected semantic formation value for the second training data set used for the second content may be a specified value is of course

다른 실시예에서 복수의 언어 훈련 컨텐츠에서 이용되는 훈련 데이터 셋은 동일할 수도 있다. 예를 들어 도 5에 도시된 바를 참조할 때, 2번 컨텐츠인 "그림보고 의미범주 분류"와 3번 컨텐츠인 "그림보고 의미자질 선택"에 이용되는 훈련 데이터 셋은 동일할 수 있다. 이 경우 2번 컨텐츠와 3번 컨텐츠에서 이용되는 훈련 데이터 셋에 대한 사용자의 의미전형성 수치는 상기 2번 컨텐츠의 수행 결과 및 3번 컨텐츠의 수행 결과를 기초로 더 정확히 측정될 수 있다.In another embodiment, training data sets used in a plurality of language training contents may be the same. For example, referring to the bar shown in FIG. 5 , the training data set used for the second content, “classification of semantic categories by looking at pictures” and the third content, “selection of semantic features by looking at pictures” may be the same. In this case, the user's semantic formation value for the training data set used in the second content and the third content may be more accurately measured based on the performance result of the second content and the third content.

도 6은 본 발명의 일 실시예에 따른 음성 인식 모델이 단어를 인식하는 방법에 대하여 상세히 설명한다.6 is a detailed description of a method for recognizing a word by a speech recognition model according to an embodiment of the present invention.

본 발명의 몇몇 실시예에 따른 음성 인식 모델은 사용자 발화를 정확하게 검출하기 위하여 ERN(Extended Recognition Network) 기반으로 실시간 음성 인식을 수행할 수 있다. 도 6은 조음(GoP, Goodness of Pronunciation)문제를 찾기 위한 전형성 단어와 에러 패턴의 예시를 도시한 도면이다. 일 실시예에 따른 음성 인식 모델은 언어별 음성 DB 및 언어별 음성 인식 모델 적응을 수행할 수 있으므로, 언어 각각에 대한 사용자의 발화 특성을 반영한 오류 패턴을 정의할 수 있다. 또한, 일 실시예에 따른 음성 인식 모델은 비장애음(사용자의 발화)과 장애음간 가능도(likelihood) 비교를 위한 모델 각각을 포함할 수도 있다. 뿐만 아니라 음성 음식 모델은 신뢰도 점수(Confidence Score) 측정을 위한 GoP 디코더를 포함할 수도 있다.The voice recognition model according to some embodiments of the present invention may perform real-time voice recognition based on an Extended Recognition Network (ERN) in order to accurately detect a user's utterance. 6 is a diagram illustrating an example of a typical word and an error pattern for finding a problem of articulation (GoP, Goodness of Pronunciation). Since the speech recognition model according to an embodiment can perform adaptation of the speech DB for each language and the speech recognition model for each language, it is possible to define error patterns reflecting the user's speech characteristics for each language. In addition, the speech recognition model according to an embodiment may include each model for comparing the likelihood between a non-obstructed sound (a user's utterance) and a disordered sound. In addition, the speech food model may include a GoP decoder for Confidence Score measurement.

이하 도 7 내지 도 16을 참조하여 본 발명의 일 실시예에 따른 언어 훈련 방법을 수행하는 사용자 단말(100)의 인터페이스에 대하여 상세히 설명한다.Hereinafter, an interface of the user terminal 100 performing a language training method according to an embodiment of the present invention will be described in detail with reference to FIGS. 7 to 16 .

먼저, 본 발명의 일 실시예에 따른 언어 훈련 방법을 수행하는 사용자 단말(100)은 도 7에 도시된 바와 같이 언어 훈련 프로토콜에 포함된 복수의 언어 훈련 컨텐츠의 종류 및 복수의 언어 훈련 컨텐츠의 수행 순서에 대한 정보를 표시할 수 있다. 이후, 사용자 단말(100)은 해당 언어 훈련 프로토콜을 수행하고자 하는 사용자의 입력(예를 들어, 사용자의 음성 데이터)에 응답하여 복수의 언어 훈련 컨텐츠를 순차적으로 실행할 수 있다.First, as shown in FIG. 7 , the user terminal 100 for performing the language training method according to an embodiment of the present invention performs a plurality of types of language training contents and a plurality of language training contents included in a language training protocol. Information about the sequence can be displayed. Thereafter, the user terminal 100 may sequentially execute a plurality of language training contents in response to a user's input (eg, user's voice data) who wants to perform a corresponding language training protocol.

일 실시예에 따르면 사용자 단말(100)에서 수행된 복수의 언어 훈련 컨텐츠는 사용자의 음성 데이터에 응답하여 실행될 수 있다. 한편, 복수의 언어 훈련 컨텐츠 각각은 사용자의 언어 능력뿐 아니라 다른 인지 능력에 대한 훈련을 더 수행할 수 있다. 도 8 내지 도 13은 사용자의 언어 능력뿐 아니라 사용자의 지각 능력도 훈련하기 위한 언어 훈련 컨텐츠의 인터페이스의 일 예를 도시한다. 도 8에 도시된 컨텐츠는 제한 시간 내에 화면에 도시된 지시(예를 들어, "이기는 것을 말하세요")에 따른 가위 바위 보를 음성으로 수행하는 컨텐츠이다. 또한, 도 9에 도시된 컨텐츠는 화면에 표시된 블록의 색상과 해당 블록에 표시된 글자의 의미가 잘 매칭되었는지 판단하고, 잘못 매칭된 블록에 표시된 글자를 발화하는 컨텐츠이다. 그리고 도 10에 도시된 컨텐츠는 움직이는 UI객체를 인지하고, 해당 UI객체에 표시된 숫자를 발화해야 하는 컨텐츠이다. 도 11에 도시된 컨텐츠는 화면에 표시된 화살표의 이동 방향을 확인하고, 해당 방향을 발화하는 컨텐츠이다. 도 12에 도시된 컨텐츠는 화면에 상, 하 또는 좌, 우가 반전되어 표시된 글자를 인지하여 발화하는 컨텐츠이다. 도 13에 도시된 컨텐츠는 화면에 표시된 문장의 빈칸에 들어갈 단어를 발화하는 컨텐츠이다. 전술한 도 8 내지 도 13에 도시된 컨텐츠를 표시한 사용자 단말(100)은 제한 시간 동안 입력된 사용자의 음성 데이터를 분석하여 사용자의 언어 능력을 평가할 수 있으며, 해당 컨텐츠에 대한 사용자의 발화 내용을 기초로 사용자의 지각 능력을 평가할 수 있다. 이 경우 도 8 내지 도 13에 도시된 언어 훈련 컨텐츠는 모두 사용자의 음성 데이터에 의해 수행되는 것은 물론이다. 다만, 이는 언어 훈련 컨텐츠의 일 예시일 뿐, 본 발명의 몇몇 실시예에 따른 언어 훈련 방법에 따른 언어 훈련 컨텐츠의 종류가 이에 한정되는 것은 아님에 유의한다. 이후, 도 14에 도시된 바와 같이 일 실시예에 따른 사용자 단말(100)은 정해진 언어 훈련 프로토콜의 수행 결과에 기초하여 사용자에게 해당 컨텐츠의 수행 결과 화면을 표시할 수 있다.According to an embodiment, the plurality of language training contents performed by the user terminal 100 may be executed in response to the user's voice data. Meanwhile, each of the plurality of language training contents may further perform training for not only the user's language ability but also other cognitive abilities. 8 to 13 show an example of an interface of language training content for training not only the user's language ability but also the user's perceptual ability. The content shown in FIG. 8 is a content that performs the paper-rock-paper-scissors by voice according to an instruction (eg, "say to win") shown on the screen within a time limit. In addition, the content shown in FIG. 9 is content that determines whether the color of the block displayed on the screen matches the meaning of the letter displayed on the block well, and utters the letter displayed on the incorrectly matched block. And, the content shown in FIG. 10 is content for recognizing a moving UI object and uttering a number displayed on the UI object. The content illustrated in FIG. 11 is content that confirms the moving direction of the arrow displayed on the screen and utters the corresponding direction. The content illustrated in FIG. 12 is content that recognizes and utters characters displayed on the screen by inverting up, down, or left and right. The content illustrated in FIG. 13 is content that utters a word to be inserted into a blank space of a sentence displayed on the screen. The user terminal 100 displaying the contents shown in FIGS. 8 to 13 described above can analyze the user's voice data input for a limited time to evaluate the user's language ability, and the contents of the user's utterance for the corresponding contents. Based on this, the user's perceptual ability can be evaluated. In this case, it goes without saying that all of the language training contents shown in FIGS. 8 to 13 are performed by the user's voice data. However, it should be noted that this is only an example of language training content, and the type of language training content according to the language training method according to some embodiments of the present invention is not limited thereto. Thereafter, as shown in FIG. 14 , the user terminal 100 according to an embodiment may display a performance result screen of the corresponding content to the user based on the execution result of the predetermined language training protocol.

한편, 본 발명의 일 실시예에 따라 서버(110)에서 사용자의 언어 훈련 결과를 관리하는 경우, 서버(110)는 사용자 단말(100)로부터 사용자의 언어 훈련 결과 데이터를 수신할 수 있다. 이 경우, 서버(110)는 사용자 별 인지 능력 정보를 저장 및 관리할 수 있고, 사용자가 수행한 언어 훈련 컨텐츠의 결과에 따른 인지 능력 변화를 도 15에 도시된 것과 같이 도식화 하여 표시할 수 있다. 뿐만 아니라, 언어 훈련 프로토콜에 포함된 언어 훈련 컨텐츠의 종류, 순서 및 난이도와 관련된 모든 정보를 저장 및 관리하여 도 16에 도시된 것과 같이 사용자에게 제공할 수 있다. 전술한 도 7 내지도 16 에 도시된 인터페이스는 일 예시임에 유의한다. 또한, 서버(110)가 수행하는 동작은 사용자 단말에서 수행될 수도 있으며, 서버(110)의 화면에 표시되는 인터페이스는 사용자 단말(100)의 화면에서 표시될 수도 있다. 마찬가지로 사용자 단말(100)이 수행하는 동작도 서버(110)에서 수행될 수 있으며, 사용자 단말(100)의 화면에 표시되는 인터페이스는 서버(110)의 화면에 표시될 수도 있음은 물론이다.Meanwhile, when the server 110 manages the user's language training result according to an embodiment of the present invention, the server 110 may receive the user's language training result data from the user terminal 100 . In this case, the server 110 may store and manage cognitive ability information for each user, and may schematically display the cognitive ability change according to the result of the language training content performed by the user as shown in FIG. 15 . In addition, all information related to the type, order, and difficulty of language training content included in the language training protocol may be stored and managed and provided to the user as shown in FIG. 16 . Note that the interfaces shown in Figs. 7 to 16 described above are an example. In addition, an operation performed by the server 110 may be performed on the user terminal, and an interface displayed on the screen of the server 110 may be displayed on the screen of the user terminal 100 . Likewise, an operation performed by the user terminal 100 may also be performed by the server 110 , and an interface displayed on the screen of the user terminal 100 may be displayed on the screen of the server 110 , of course.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 어플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA). , a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

A method for training a user's language performed by a voice recognition device, the method comprising:
measuring a semantic formation value of the user for a plurality of training data sets when there is no previous training record of the user;
generating a target training data set based on the semantic formation value of the user when the user's previous training record exists;
updating a semantic formation value of the user with respect to the target training data set based on a result of the user's language training using the target training data set; and
Updating the speech recognition model used by the speech recognition apparatus by using the semantic formation values of a plurality of users with respect to the plurality of training data sets;
language training methods.

According to claim 1,
The training data set (set) is,
one or more data pairs comprising at least one reference word and a plurality of comparison words;
Measuring the user's semantic formation value comprises:
Based on the user's selection of one of a plurality of comparison words included in the target data pair, comprising the step of measuring a user's semantic formation value for the comparison word selected by the user,
language training methods.

3. The method of claim 2,
One or more data pairs included in the training data set,
Include words included in one category,
Measuring the semantic formation value comprises:
Further comprising the step of measuring a correlation coefficient between a plurality of data pairs included in the plurality of training data sets,
language training methods.

According to claim 1,
The step of generating the target training data set comprises:
generating a first target training data set and a second target training data set;
The deviation between the user's semantic formation value for the first target training data set and the user's semantic formation value for the second target training data set is a specified value,
language training methods.

5. The method of claim 4,
an average of the user's semantic formation value for the first target training data set and the user's semantic formation value for the second target training data set is a specified value;
language training methods.

5. The method of claim 4,
The step of updating the user's semantic formation value includes:
After the user's semantic formation value is updated based on the user's language training result using the first target training data set, the user's semantic formation is based on the user's language training result using the second target training data set updating the numerical value;
The user's semantic formation value for the first target training data set is greater than the user's semantic formation value for the second target training data set,
The semantic formation value is a number that increases as there is no typicality between words included in the target training data set,
language training methods.

According to claim 1,
The speech recognition model is
and a speech recognition decoder for each language corresponding to each of the plurality of languages,
Updating the speech recognition model comprises:
Including the step of updating the speech recognition decoder corresponding to the language used by the user,
language training methods.

an input/output interface for receiving a user's voice; and
processor; including,
The processor is
When the user's previous training record does not exist, the user's semantic formation value is measured for a plurality of training data sets, and when the user's previous training record exists, the user's semantic formation value is based on the to generate a target training data set, and based on the user's language training result using the target training data set, update the user's semantic formation value for the target training data set, and for the plurality of training data sets Updating the speech recognition model by using the semantic formation values of a plurality of users,
language training device.