KR20210131698A

KR20210131698A - Method and apparatus for teaching foreign language pronunciation using articulator image

Info

Publication number: KR20210131698A
Application number: KR1020200050108A
Authority: KR
Inventors: 손단영; 윤용욱; 장두성
Original assignee: 주식회사 케이티
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2021-11-03

Abstract

The present invention relates to a method for teaching foreign language pronunciation using speech organ images, which increases a learner's convenience, and an apparatus thereof. According to the present invention, a method is operated by a computing device operated by at least one processor and comprises the following steps: receiving an input of a learner's pronunciation voice and image for an arbitrary text; generating a standard phoneme list of the text; generating a learner's pronunciation phoneme list by extracting phonemes from which the text is pronounced from the learner pronunciation voice and image; and inputting the standard pronunciation phoneme list into a learned image generation model to generate a standard pronunciation simulation image for pronouncing the standard pronunciation phoneme list, inputting the learner's pronunciation voice and image and the learner's pronunciation phoneme list to an image generation model to generate a learner's pronunciation simulation image for pronouncing the learner's pronunciation phoneme list, and outputting the standard pronunciation simulation image and the learner's pronunciation simulation image, wherein the standard pronunciation simulation image and the learner pronunciation simulation image are virtual pronunciation organ images simulating a pronunciation process.

Description

A method and apparatus for teaching foreign language pronunciation using an image of a pronunciation organ

본 발명은 발음 기관 영상을 이용하여 외국어 발음을 교육하는 기술에 관한 것이다.The present invention relates to a technique for teaching foreign language pronunciation using an image of a pronunciation organ.

글로벌 시대에서 외국어를 통한 의사 소통 능력이 매우 중요해지고 있다. 특히 원활한 의사 소통을 위해, 발음은 중요한 역할을 한다. 그러나 우리나라 외국어 교육 현실에서는 입시 위주의 교육으로 인해 발음 교육보다는, 문법이나 어휘 등에 비중을 두고 있고, 발음에 대해서도 아직도 발음을 분절하여 제공하는 위주의 발음 교육이 이루어지고 있는 실정이다. In the global era, communication skills through foreign languages are becoming very important. In particular, for smooth communication, pronunciation plays an important role. However, in the reality of foreign language education in Korea, due to the entrance exam-oriented education, emphasis is placed on grammar and vocabulary rather than pronunciation education.

구체적으로, 전통적인 영어 등의 외국어 발음 교육은, 학습자에게 원어민의 발음을 분절한 화면 영상을 텍스트 또는 이미지로 제공하고 발음을 따라하도록 유도하고 글로 설명하는 방식으로 제공된다. Specifically, the pronunciation education of a foreign language such as traditional English is provided by providing a screen image of a native speaker's pronunciation as text or images to the learner, inducing them to follow the pronunciation, and explaining in writing.

예를 들어, 이미 정해진 규칙에 따라 발음 기관 영상을 3D로 생성하여 출력하거나, 기존에 생성된 3D 영상에 일부 음성의 특징을 추가하여 영상을 간단히 변형하여 제공한다. For example, a pronunciation organ image is generated and output in 3D according to a predetermined rule, or the image is simply modified and provided by adding some audio features to an existing 3D image.

발음 기관을 3D로 시각화하여 제공하는 방법은, 음성이 발성될 때 혀의 위치나 모양, 입술의 모양 등을 사용자가 이해하기 쉬운 그림이나 동영상으로 제공하여, 학습자에게 특정 발음에 대한 영상 정보를 제공할 수 있다. 그러나 이렇게 제공되는 정보만으로는 학습자 입장에서 스스로 어떻게 발음하는지, 어떻게 고쳐야 원어민의 발음과 동일해지는지 알기 어렵다. 학습자가 발음할 때의 발음 기관이 3D로 어떻게 표현되는지 알 수 없기 때문이다.To provide a 3D visualization of the pronunciation organ, the position or shape of the tongue and the shape of the lips are provided as pictures or videos that are easy for users to understand when voice is uttered, and video information about specific pronunciation is provided to learners. can do. However, it is difficult to know how to pronounce on their own from the learner's point of view and how to correct the pronunciation of a native speaker only with the information provided in this way. This is because it is not possible to know how the pronunciation organ is represented in 3D when the learner pronounces it.

즉, 학습자의 발음 기관 영상을 화면에 출력하여, 표준 역할을 하는 원어민의 발음 기관 영상과의 차이를 직접 확인하고, 어떻게 학습자가 발음을 교정해야 하는지를 영상으로 제공하는 학습 방법이 요구된다. That is, there is a need for a learning method that outputs an image of a learner's pronunciation organ on the screen, directly confirms the difference from the image of a pronunciation organ of a native speaker who plays a standard role, and provides an image of how the learner should correct pronunciation.

해결하고자 하는 과제는 학습된 딥러닝 모델을 통해, 임의의 문장을 발음할 때의 발음 기관의 모양 및 움직임을 3차원 영상으로 나타내는 방법을 제공하는 것이다.The task to be solved is to provide a method of representing the shape and movement of a pronunciation organ when pronouncing an arbitrary sentence as a three-dimensional image through a learned deep learning model.

또한, 해결하고자 하는 과제는 동일한 텍스트에 대해 학습자의 발음 기관이 움직이는 모습과 원어민의 발음 기관이 움직이는 모습을 동시에 제공하여, 발음 기관의 차이를 보여주고 표준 발음을 위한 발음 가이드를 제공하는 것이다. In addition, the task to be solved is to simultaneously provide the movement of the learner's pronunciation organ and the movement of the native speaker's pronunciation organ for the same text, showing the difference between the pronunciation organs and providing a pronunciation guide for standard pronunciation.

한 실시예에 따른 적어도 하나의 프로세서에 의해 동작하는 컴퓨팅 장치가 동작하는 방법으로서, 임의의 텍스트에 대한 학습자 발음 음성과 영상을 입력받는 단계, 상기 텍스트의 표준 발음 음소 리스트를 생성하는 단계, 상기 학습자 발음 음성과 영상으로부터, 상기 텍스트가 발음된 음소들을 추출하여 학습자 발음 음소 리스트를 생성하는 단계, 그리고 상기 표준 발음 음소 리스트를 학습된 영상 생성 모델에 입력하여 상기 표준 발음 음소 리스트를 발음하는 표준 발음 시뮬레이션 영상을 생성하고, 상기 학습자 발음 음성과 영상, 상기 학습자 발음 음소 리스트를 상기 영상 생성 모델에 입력하여 상기 학습자 발음 음소 리스트를 발음하는 학습자 발음 시뮬레이션 영상을 생성하고, 상기 표준 발음 시뮬레이션 영상과 상기 학습자 발음 시뮬레이션 영상을 출력하는 단계를 포함하고, 상기 표준 발음 시뮬레이션 영상과 상기 학습자 발음 시뮬레이션 영상은 발음 과정을 모사하는 가상의 발음기관 영상이다. A method of operating a computing device operated by at least one processor according to an embodiment, the method comprising: receiving a learner pronunciation voice and an image for an arbitrary text; generating a standard pronunciation phoneme list of the text; the learner A standard pronunciation simulation of extracting phonemes from which the text is pronounced from a pronunciation voice and an image to generate a learner pronunciation phoneme list, and inputting the standard pronunciation phoneme list into a learned image generation model to pronounce the standard pronunciation phoneme list generating an image, inputting the learner pronunciation voice and image, and the learner pronunciation phoneme list into the image generation model to generate a learner pronunciation simulation image for pronouncing the learner pronunciation phoneme list, the standard pronunciation simulation image and the learner pronunciation and outputting a simulation image, wherein the standard pronunciation simulation image and the learner pronunciation simulation image are virtual pronunciation organ images simulating a pronunciation process.

상기 표준 발음 시뮬레이션 영상과 상기 학습자 발음 시뮬레이션 영상에 포함된 각 발음기관들의 형태 또는 움직임의 차이점을 분석하고, 상기 차이점을 포함하는 발음 가이드를 생성하는 단계를 더 포함할 수 있다.The method may further include analyzing a difference in the shape or movement of each pronunciation organ included in the standard pronunciation simulation image and the learner pronunciation simulation image, and generating a pronunciation guide including the difference.

음소별 구강의 모양을 포함하는 제1 영상과 상기 음소별 발음기관 단면의 모양을 포함하는 제2 영상을 수집하는 단계, 상기 제1 영상 또는 상기 제2 영상에 해당하는 음소, 발화자의 성별과 연령대를 포함하는 발화자 정보, 상기 음소를 발음할 때의 구강의 모양과 발음기관의 모양을 태깅하여 학습 데이터를 생성하는 단계, 그리고 상기 학습 데이터로 음소별 상기 구강의 모양 및 상기 발음기관 단면의 모양과의 관계를 상기 영상 생성 모델에 학습시키는 단계를 더 포함할 수 있다.collecting a first image including a shape of the oral cavity for each phoneme and a second image including a shape of a section of a pronunciation organ for each phoneme, a phoneme corresponding to the first image or the second image, and the gender and age of the speaker generating learning data by tagging the speaker information, the shape of the oral cavity and the shape of the pronunciation organ when pronouncing the phoneme, and the shape of the oral cavity and the shape of the section of the pronunciation organ for each phoneme as the learning data. It may further include the step of learning the relationship of the image generation model.

상기 구강의 모양은 상기 구강의 크기, 입술의 모양, 아랫니와 윗니가 서로 닿는지 여부 또는 혀의 위치 중 적어도 어느 하나를 포함할 수 있다.The shape of the oral cavity may include at least one of the size of the oral cavity, the shape of the lips, whether the lower teeth and the upper teeth contact each other, or the position of the tongue.

상기 음소는 음성 기호로 표시되고, 상기 음성 기호는 국제 음성 기호(International Phonetic Alphabet, IPA)를 상기 발화자 정보에 따라 변형된 것일 수 있다. The phoneme may be displayed as a phonetic symbol, and the phonetic symbol may be an International Phonetic Alphabet (IPA) modified according to the speaker information.

상기 학습 데이터는, 상기 발화자 정보에 따라 변형된 음성 기호와 상기 국제 음성 기호와의 차이를 수치로 나타낸 값을 더 포함할 수 있다.The learning data may further include a numerical value representing a difference between the phonetic symbol transformed according to the speaker information and the international phonetic symbol.

한 실시예에 따른 컴퓨팅 장치로서, 메모리, 그리고 상기 메모리에 로드된 프로그램의 명령들(instructions)을 실행하는 적어도 하나의 프로세서를 포함하고, 상기 프로그램은 음소별 구강의 모양과 발음기관 단면의 모양을 포함하는 학습 영상을 수집하고, 상기 학습 영상에서 발음된 음소, 상기 발음된 음소의 표준 음소, 발화자의 성별과 연령대를 포함하는 발화자 정보를 태깅하여 학습 데이터를 생성하는 단계, 상기 학습 데이터로, 입력된 음소의 표준 음소가 발음되는 가상의 발음기관 시뮬레이션 영상을 출력하는 영상 생성 모델을 학습시키는 단계, 임의의 텍스트에 대한 학습자 발음 음성과 영상을 상기 영상 생성 모델에 입력하는 단계, 그리고 상기 영상 생성 모델에서 출력된 상기 텍스트의 표준 발음 시뮬레이션 영상을 화면에 표시하는 단계를 실행하도록 기술된 명령들을 포함한다.A computing device according to an embodiment, comprising a memory and at least one processor executing instructions of a program loaded in the memory, wherein the program is configured to determine the shape of the oral cavity and the shape of the section of the pronunciation organ for each phoneme. generating learning data by collecting a learning image including Learning an image generation model for outputting a virtual pronunciation organ simulation image in which standard phonemes of the phonemes are pronounced, inputting the learner's pronunciation voice and image for arbitrary text into the image generation model, and the image generation model It includes instructions described to execute the step of displaying the standard pronunciation simulation image of the text output from the screen.

상기 학습 데이터는, 상기 학습 영상에서 추출된 히스토그램 정보 또는 특징점 정보를 더 포함할 수 있다. The training data may further include histogram information or feature point information extracted from the training image.

상기 학습시키는 단계는, 연속하여 발음될 수 있는 음소 관계를 확률로 계산하고, 상기 표시하는 단계는, 상기 확률을 바탕으로 상기 텍스트의 표준 음소 리스트와 상기 텍스트의 발음된 음소 리스트를 생성할 수 있다.The learning may include calculating a phoneme relationship that can be consecutively pronounced with a probability, and the displaying may include generating a standard phoneme list of the text and a phoneme list of pronunciation of the text based on the probability. .

상기 가상의 발음기관이 상기 텍스트를 발음하는 학습자 발음 시뮬레이션 영상을 생성하는 단계를 더 포함할 수 있다.The method may further include generating, by the virtual pronunciation organ, a learner pronunciation simulation image pronouncing the text.

상기 표준 음소 리스트와 상기 텍스트의 발음된 음소 리스트의 차이점을 분석하여 제공하는 단계를 더 포함할 수 있다.The method may further include analyzing and providing a difference between the standard phoneme list and the phoneme list pronounced in the text.

본 발명에 따르면 학습자가 발음한 텍스트에 대해 원어민의 표준 발음을 들을 수 있고, 표준 발음을 위한 발음 기관의 움직임을 시뮬레이션 영상으로도 확인할 수 있어, 발음 교육에 효과적이다.According to the present invention, it is possible to listen to the standard pronunciation of the native speaker for the text pronounced by the learner, and the movement of the pronunciation organ for the standard pronunciation can be checked through a simulation image, which is effective for pronunciation education.

또한, 본 발명에 따르면 원어민의 발음 기관 영상뿐만 아니라 학습자의 발음 기관 영상을 함께 제공하므로, 학습자 스스로 교정해야 할 부분을 영상으로 확인할 수 있고 교사의 도움 없이도 발음 교정을 혼자서 진행할 수 있는바, 학습자의 편의성을 높일 수 있다.In addition, according to the present invention, not only the image of the pronunciation organ of the native speaker but also the image of the learner's pronunciation organ are provided together, so that the learner can check the part to be corrected by the image, and the pronunciation correction can be performed alone without the help of the teacher. Convenience can be increased.

도 1은 한 실시예에 따른 발음 영상 생성 장치의 구성도이다.
도 2는 한 실시예에 따른 발음 영상 모델과 음소 분석 모델의 설명도이다.
도 3은 한 실시예에 따른 발음 영상 생성 장치가 음소 분석 모델을 학습시키는 과정의 흐름도이다.
도 4는 한 실시예에 따른 모음의 발음 위치를 나타낸 설명도이다.
도 5는 한 실시예에 따른 발음 기관의 특징을 나타낸 예시도이다.
도 6은 한 실시예에 따른 학습 데이터를 얻는 방법의 예시도이다.
도 7은 한 실시예에 따른 음소 분석 모델이 학습하는 음소 리스트의 예시도이다.
도 8은 한 실시예에 따른 발음 영상 생성 장치가 동작하는 방법의 흐름도이다.
도 9는 한 실시예에 따른 음소 분석 모델이 예측한 음소 리스트의 예시도이다.
도 10은 한 실시예에 따른 발음 영상 모델이 시뮬레이션 영상을 생성하는 방법의 설명도이다.
도 11과 도 12는 한 실시예에 따른 발음 영상 생성 장치가 출력하는 결과 화면의 예시도이다.
도 13은 한 실시예에 따른 컴퓨팅 장치의 하드웨어 구성도이다.1 is a block diagram of an apparatus for generating a pronunciation image according to an exemplary embodiment.
2 is an explanatory diagram of a pronunciation image model and a phoneme analysis model according to an exemplary embodiment.
3 is a flowchart of a process in which an apparatus for generating a pronunciation image according to an exemplary embodiment trains a phoneme analysis model.
4 is an explanatory diagram illustrating pronunciation positions of vowels according to an exemplary embodiment.
5 is an exemplary diagram illustrating characteristics of a pronunciation organ according to an embodiment.
6 is an exemplary diagram of a method of obtaining learning data according to an embodiment.
7 is an exemplary diagram of a phoneme list learned by a phoneme analysis model according to an embodiment.
8 is a flowchart of a method of operating an apparatus for generating a pronunciation image according to an exemplary embodiment.
9 is an exemplary diagram of a phoneme list predicted by a phoneme analysis model according to an embodiment.
10 is an explanatory diagram of a method of generating a simulation image by a pronunciation image model according to an exemplary embodiment.
11 and 12 are exemplary views of a result screen output by the apparatus for generating a pronunciation image according to an exemplary embodiment.
13 is a hardware configuration diagram of a computing device according to an embodiment.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, with reference to the accompanying drawings, embodiments of the present invention will be described in detail so that those of ordinary skill in the art to which the present invention pertains can easily implement them. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "…기", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Throughout the specification, when a part "includes" a certain element, it means that other elements may be further included, rather than excluding other elements, unless otherwise stated. In addition, terms such as “…unit”, “…group”, and “module” described in the specification mean a unit that processes at least one function or operation, which may be implemented by hardware or software or a combination of hardware and software. have.

본 명세서에서, 음성은 물리적인 말소리를 의미한다. 같은 문장이라도 음성은 발화하는 사람에 따라 달라질 수 있다. In this specification, a voice means a physical speech sound. Even in the same sentence, the voice may be different depending on the speaker.

국제 음성 기호(International Phonetic Alphabet, IPA)란 전 세계 언어의 다양한 소리를 포괄할 수 있도록 소리를 문자로 표기하는 일관된 방식으로서, 국제 음성 학회에서 창안한 기호 체계를 의미한다. 예를 들어 'apple'이라는 영어 단어의 발음을 IPA로 나타내면 ['ζpl]이고, '사나이'라는 한글 단어의 발음을 IPA로 나타내면 [sanai]이다.International Phonetic Alphabet (IPA) refers to a system of symbols created by the International Phonetic Society as a consistent method of writing sounds to cover various sounds of languages around the world. For example, if the pronunciation of the English word 'apple' is expressed as IPA, it is ['ζpl], and if the pronunciation of the Korean word 'man' is expressed as IPA, it is [sanai].

음소(Phoneme)란 추상적으로 인식되는 말소리의 기본 단위를 의미하며, 같은 음소라도 발음하는 사람에 따라 다른 음성으로 들릴 수 있다. 그러므로 같은 문장을 발음하더라도 해당 문장의 표준 발음 음소와 학습자가 발음한 문장의 음소가 서로 다를 수 있다. A phoneme refers to a basic unit of speech sound that is recognized abstractly, and even the same phoneme may sound different depending on the person who pronounces it. Therefore, even when the same sentence is pronounced, the standard phoneme of the corresponding sentence and the phoneme of the sentence pronounced by the learner may be different from each other.

따라서 본 명세서는 두 음소의 차이를 구별하고, 학습자가 음소를 발음하는 것과 표준 발음을 3D 영상으로 제공하여, 학습자가 표준 발음을 스스로 학습할 수 있도록 도움을 주는 것을 목적으로 한다. Therefore, the present specification aims to help the learner to learn the standard pronunciation by themselves by distinguishing the difference between the two phonemes and providing the learner's pronunciation of the phoneme and the standard pronunciation as a 3D image.

또한 본 명세서에서 별도의 언급이 없는 한, 발화 영상은 발화 음성을 포함하는 것으로 설명한다. In addition, unless otherwise stated in the present specification, it is described that an utterance video includes an uttered voice.

도 1은 한 실시예에 따른 발음 영상 생성 장치의 구성도이다.1 is a block diagram of an apparatus for generating a pronunciation image according to an exemplary embodiment.

도 1을 참고하면, 발음 영상 생성 장치(10)는 학습자가 임의의 문장을 발음한 영상을 입력받는 음성 및 영상 입력부(100), 입력된 문장을 구성하는 음소 리스트를 추출하는 음소 분석 모델(210)과 발음 영상 모델(220)을 학습시키는 학습부(200), 표준 발음 영상과 발음 기관 영상을 생성하는 발음 영상 생성부(300), 표준 발음 영상과 학습자의 발음 영상의 차이를 분석하여 제공하는 발음 가이드 제공부(400), 발음 영상 DB(500) 그리고 음소 정보 DB(600)를 포함한다.Referring to FIG. 1 , the pronunciation image generating apparatus 10 includes an audio and image input unit 100 for receiving an image in which a learner pronounces an arbitrary sentence, and a phoneme analysis model 210 for extracting a phoneme list constituting the input sentence. ) and the learning unit 200 for learning the pronunciation image model 220, the pronunciation image generation unit 300 for generating the standard pronunciation image and the pronunciation organ image, analyzing the difference between the standard pronunciation image and the learner's pronunciation image It includes a pronunciation guide providing unit 400 , a pronunciation image DB 500 , and a phoneme information DB 600 .

설명을 위해, 음성 및 영상 입력부(100), 학습부(200), 발음 영상 생성부(300), 발음 가이드 제공부(400), 발음 영상 DB(500) 그리고 음소 정보 DB(600)로 명명하여 부르나, 이들은 적어도 하나의 프로세서에 의해 동작하는 컴퓨팅 장치이다. 여기서, 음성 및 영상 입력부(100), 학습부(200), 발음 영상 생성부(300), 발음 가이드 제공부(400), 발음 영상 DB(500) 그리고 음소 정보 DB(600)는 하나의 컴퓨팅 장치에 구현되거나, 별도의 컴퓨팅 장치에 분산 구현될 수 있다. 별도의 컴퓨팅 장치에 분산 구현된 경우, 음성 및 영상 입력부(100), 학습부(200), 발음 영상 생성부(300), 발음 가이드 제공부(400), 발음 영상 DB(500) 그리고 음소 정보 DB(600)는 통신 인터페이스를 통해 서로 통신할 수 있다. 컴퓨팅 장치는 본 발명을 수행하도록 작성된 소프트웨어 프로그램을 실행할 수 있는 장치이면 충분하고, 예를 들면, 서버, 랩탑 컴퓨터 등일 수 있다. For explanation, the audio and image input unit 100, the learning unit 200, the pronunciation image generation unit 300, the pronunciation guide providing unit 400, the pronunciation image DB 500, and the phoneme information DB 600 are named. However, they are computing devices operated by at least one processor. Here, the audio and image input unit 100 , the learning unit 200 , the pronunciation image generation unit 300 , the pronunciation guide providing unit 400 , the pronunciation image DB 500 and the phoneme information DB 600 are one computing device. It may be implemented in , or distributed in a separate computing device. When distributed in a separate computing device, the voice and image input unit 100 , the learning unit 200 , the pronunciation image generator 300 , the pronunciation guide provider 400 , the pronunciation image DB 500 , and the phoneme information DB 600 may communicate with each other via a communication interface. The computing device may be any device capable of executing a software program written to carry out the present invention, and may be, for example, a server, a laptop computer, or the like.

음성 및 영상 입력부(100), 학습부(200), 발음 영상 생성부(300) 그리고 발음 가이드 제공부(400) 각각은 하나의 인공지능 모델일 수 있고, 복수의 인공지능 모델로 구현될 수도 있다. 그리고 음소 분석 모델(210)과 발음 영상 모델(220)도 하나의 인공지능 모델일 수 있고, 복수의 인공지능 모델로 구현될 수도 있다. 발음 영상 생성 장치(10)는 하나의 인공지능 모델일 수 있고, 복수의 인공지능 모델로 구현될 수도 있다. 이에 따라, 상술한 구성들에 대응하는 하나 또는 복수의 인공지능 모델은 하나 또는 복수의 컴퓨팅 장치에 의해 구현될 수 있다.Each of the audio and video input unit 100 , the learning unit 200 , the pronunciation image generation unit 300 , and the pronunciation guide providing unit 400 may be one AI model or may be implemented as a plurality of AI models. . Also, the phoneme analysis model 210 and the pronunciation image model 220 may be one AI model or may be implemented as a plurality of AI models. The pronunciation image generating apparatus 10 may be one artificial intelligence model or may be implemented as a plurality of artificial intelligence models. Accordingly, one or a plurality of artificial intelligence models corresponding to the above-described configurations may be implemented by one or a plurality of computing devices.

음성 및 영상 입력부(100)는 학습자가 발음하는 임의의 문장의 음성과 영상을 입력받는다. 카메라와 마이크를 통해 입력받을 수 있으며, 학습자의 정면과 측면 영상을 획득할 수 있다. The audio and video input unit 100 receives audio and images of arbitrary sentences pronounced by the learner. Inputs can be received through the camera and microphone, and the learner's front and side images can be acquired.

학습부(200)는 발음 영상 DB(500)와 음소 정보 DB(600)에 포함된 문장별 표준 발음 음소 정보를 이용하여 음소 분석 모델(210)과 발음 영상 모델(220)을 학습시킨다. 구체적으로, 음소 분석 모델(210)은 문장이 발음된 음성과 영상을 토대로 해당 문장을 구성하는 음소들의 리스트를 추출하는 역할을 한다. 발음 영상 모델(220)은 음소 리스트가 입력되면 각 음소가 발음되는 조음기관의 움직임을 단면도로 생성하는 역할을 한다. The learning unit 200 learns the phoneme analysis model 210 and the pronunciation image model 220 by using the standard pronunciation phoneme information for each sentence included in the pronunciation image DB 500 and the phoneme information DB 600 . Specifically, the phoneme analysis model 210 serves to extract a list of phonemes constituting the corresponding sentence based on the voice and image in which the sentence is pronounced. When a phoneme list is input, the pronunciation image model 220 serves to generate a cross-sectional view of the movement of the articulatory organ in which each phoneme is pronounced.

발음 영상 생성부(300)는 학습된 발음 영상 모델(220)을 이용하여, 입력된 학습자 영상의 문장을 발음하는 3D 영상을 생성하고, 해당 문장을 발음하는 3D 표준 발음 영상을 생성한다. The pronunciation image generator 300 generates a 3D image for pronouncing a sentence of the input learner image by using the learned pronunciation image model 220 , and generates a 3D standard pronunciation image for pronouncing the sentence.

발음 가이드 제공부(400)는 학습자의 발음 영상과 표준 발음 영상의 차이를 음소별로 비교하여, 음소별 발음 위치, 입모양, 혀의 위치 등의 차이를 학습자에게 제공한다. 발음 영상 DB(500)와 음소 정보 DB(600)에 저장된 내용을 바탕으로 발음 가이드를 생성할 수 있다. The pronunciation guide providing unit 400 compares the difference between the learner's pronunciation image and the standard pronunciation image for each phoneme, and provides the learner with differences in pronunciation positions, mouth shapes, and tongue positions for each phoneme. A pronunciation guide may be generated based on the contents stored in the pronunciation image DB 500 and the phoneme information DB 600 .

발음 가이드는 도 12와 도 13과 같은 인터페이스를 통해 제공될 수 잇고, 학습자의 편의를 위해 음소별로 영상 또는 텍스트가 표시될 수 있다. 또한 학습자의 3D 발음 영상과 표준 발음 영상을 겹쳐서 제공할 수 있다. The pronunciation guide may be provided through the interface shown in FIGS. 12 and 13 , and images or texts may be displayed for each phoneme for the convenience of the learner. In addition, the learner's 3D pronunciation image and the standard pronunciation image can be provided by overlapping.

발음 영상 DB(500)는 음소를 발음할 때 발음 기관의 움직임을 신체 외부에서 촬영한 영상과 신체 내부의 단면도 영상을 포함한다. 발음 영상 DB(500)에 저장된 영상들은 2D이거나 3D일 수 있다. 영상들은 웹 크롤링을 통해 수집되거나, 모델 학습을 위해 촬영된 것일 수 있다. The pronunciation image DB 500 includes an image of the movement of the pronunciation organ when pronouncing a phoneme from the outside of the body and a cross-sectional image of the inside of the body. The images stored in the pronunciation image DB 500 may be 2D or 3D. The images may be collected through web crawling or taken for model training.

한편, 발음 영상 DB(500)에 저장된 영상은 원어민 또는 그에 준하는 자가 발음한 영상일 수 있다. 학습자가 발음한 문장을 원어민이 발음하는 것과 같은 형태로 변형하여 표준 발음 영상을 생성하기 위해 사용되기 때문이다. Meanwhile, the image stored in the pronunciation image DB 500 may be an image pronounced by a native speaker or a person equivalent thereto. This is because it is used to generate a standard pronunciation image by transforming the sentences pronounced by the learner into the same form as those pronounced by a native speaker.

즉 발음 영상 DB(500)에 포함된 영상들은 임의의 음소의 표준 발음 영상으로 간주될 수 있다. That is, the images included in the pronunciation image DB 500 may be regarded as standard pronunciation images of arbitrary phonemes.

음소 정보 DB(600)는 각 음소를 음성 기호로 나타낸 국제 표준 IPA 발음 기호를 포함한다. 한편, 표준 IPA 발음 기호는 학습자의 발음 특성을 반영하기 위해 관리자에 의해 수정될 수 있다. 자세한 내용은 도 4를 통해 설명한다.The phoneme information DB 600 includes international standard IPA phonetic symbols representing each phoneme as a phonetic symbol. Meanwhile, the standard IPA phonetic symbols may be modified by the administrator to reflect the pronunciation characteristics of the learner. Details will be described with reference to FIG. 4 .

도 2는 한 실시예에 따른 발음 영상 모델과 음소 분석 모델의 설명도이다.2 is an explanatory diagram of a pronunciation image model and a phoneme analysis model according to an exemplary embodiment.

도 2를 참고하면, 음소 분석 모델(210)은 원어민이 특정 음소를 발음하는 영상, 발화한 문장, 발화자의 성별 및 연령대, 발음 기관의 움직임을 포함하는 단면도 영상, 해당 음소 발음시의 발음 기관의 특징 등을 포함하는 학습 데이터를 이용하여, 임의의 문장을 구성하는 음소들을 출력하도록 학습된다. Referring to FIG. 2 , the phoneme analysis model 210 is an image of a native speaker pronouncing a specific phoneme, a uttered sentence, a sectional image including the speaker's gender and age group, and the movement of the pronunciation organ, It is learned to output phonemes constituting an arbitrary sentence by using learning data including features and the like.

한편 발음 영상 및 단면도 영상에 레이블링되는 정보들은 표 2의 형식일 수 있다. 학습 데이터에 대한 자세한 내용은 도 3을 통해 설명한다.Meanwhile, information labeled on the pronunciation image and the cross-sectional image may be in the format of Table 2. Details of the training data will be described with reference to FIG. 3 .

학습된 음소 분석 모델(210)로, 학습자의 발화 영상과 발화 음성을 입력하면 음성을 구성하는 음소들이 출력될 수 있다. 또한, 입력되는 데이터의 형태는 영상, 음성에 한정되지 않고, 텍스트 형태일 수 있다. With the learned phoneme analysis model 210 , when a learner's speech image and speech voice are input, phonemes constituting the speech may be output. In addition, the form of the input data is not limited to video and audio, but may be a text form.

이 경우, 학습된 음소 분석 모델(210)에 표준 발음 영상을 생성하기 위해 텍스트 문장이 입력되면, 텍스트 문장을 구성하는 음소들의 리스트를 출력할 수 있다. In this case, when a text sentence is input to the learned phoneme analysis model 210 to generate a standard pronunciation image, a list of phonemes constituting the text sentence may be output.

발음 영상 모델(220)은 위의 학습 데이터를 이용하여, 인체 단면도를 모방한 3D 발음 영상을 출력하도록 학습된다. The pronunciation image model 220 is trained to output a 3D pronunciation image simulating a cross-section of a human body by using the above learning data.

학습된 발음 영상 모델(220)에 학습자의 발화 영상 및 발화 음성을 입력하면, 학습자의 발음 기관을 3D로 시뮬레이션하는 영상이 생성될 수 있다. 3D 발음 영상은 학습자가 발화한 문장을 발음하는 과정과 발음 기관들의 움직임을 영상으로 출력할 수 있다. When the learner's speech image and speech voice are input to the learned pronunciation image model 220 , an image simulating the learner's pronunciation organ in 3D may be generated. The 3D pronunciation image may output the process of pronunciation of the sentence spoken by the learner and the movement of the pronunciation organs as an image.

한편 발음 영상 모델(220)은, 영상을 생성하기 위해 음소 분석 모델(210)에서 출력된 음소 리스트를 추가로 이용할 수 있다. 이하에서는 각 모델의 학습 과정에 대한 자세히 설명한다. Meanwhile, the pronunciation image model 220 may additionally use the phoneme list output from the phoneme analysis model 210 to generate an image. Hereinafter, the learning process of each model will be described in detail.

도 3은 한 실시예에 따른 발음 영상 생성 장치가 음소 분석 모델을 학습시키는 과정의 흐름도이고, 도 4는 한 실시예에 따른 모음의 발음 위치를 나타낸 설명도이고, 도 5는 한 실시예에 따른 발음 기관의 특징을 나타낸 예시도이고, 도 6은 한 실시예에 따른 학습 데이터를 얻는 방법의 예시도이고, 도 7은 한 실시예에 따른 음소 분석 모델이 학습하는 음소 리스트의 예시도이다.3 is a flowchart of a process in which the phoneme analysis model is trained by the apparatus for generating a pronunciation image according to an embodiment; FIG. 4 is an explanatory diagram illustrating pronunciation positions of vowels according to an embodiment; and FIG. It is an exemplary diagram showing characteristics of a pronunciation organ, FIG. 6 is an exemplary diagram of a method of obtaining learning data according to an embodiment, and FIG. 7 is an exemplary diagram of a phoneme list learned by a phoneme analysis model according to an embodiment.

도 3을 참고하면, 발음 영상 생성 장치(10)는 외국인 또는 원어민의 발음 영상 및 발음 기관 단면도 영상을 수집한다(S110). 한 예로서, 발음 영상 DB(500)에 수집된 영상들을 이용할 수 있다. Referring to FIG. 3 , the pronunciation image generating apparatus 10 collects a pronunciation image of a foreigner or a native speaker and a cross-sectional image of a pronunciation organ ( S110 ). As an example, images collected in the pronunciation image DB 500 may be used.

한편, 발음 영상 DB(500)에 모든 음소의 발음 영상과 발음 기관 단면도 영상이 없을 수 있다. 이 경우, 이미 수집된 발음 영상과 발음 기관 단면도 영상을 기반으로 다른 음소의 발음 영상을 생성하여 학습 데이터로 사용할 수 있다. 한 예로서, 생성적 적대 신경망(Generative Adversarial Networks, GAN)을 이용할 수 있다.Meanwhile, in the pronunciation image DB 500 , there may be no pronunciation images of all phonemes and a cross-sectional image of the pronunciation organs. In this case, a pronunciation image of another phoneme may be generated based on the already collected pronunciation image and the cross-sectional image of the pronunciation organ and used as learning data. As an example, generative adversarial networks (GANs) may be used.

발음 영상 생성 장치(10)는 음소를 음성 기호로 표시한 음소 정보를 수집하고, 한국인 발음 특성에 따라 이를 수정한다(S120). 음소 정보란 국제 표준 IPA 발음 기호를 의미하고, 이를 그대로 사용하거나 또는 학습자의 발음 특성에 따라 변형하여 사용할 수도 있다.The pronunciation image generating apparatus 10 collects phoneme information in which phonemes are expressed as phonemes, and corrects them according to Korean pronunciation characteristics (S120). Phoneme information refers to the international standard IPA phonetic symbols, and it can be used as it is or modified according to the pronunciation characteristics of the learner.

음소 정보를 변형하는 이유는, 한국인은 한국어를 주로 발음하며, 한국어에 주로 사용되는 발음 기관의 종류나 위치가 외국어와는 차이가 있으므로 이를 반영하기 위함이다. The reason for changing the phoneme information is to reflect the fact that Koreans mainly pronounce Korean, and the types and locations of pronunciation organs mainly used in Korean are different from foreign languages.

예를 들어 도 4의 (a)는 영어의 모음을 발음할 때 사용하는 IPA 발음 기호를 발음 위치에 따라 나타낸 모음 사각도이다. For example, FIG. 4( a ) is a vowel square diagram showing IPA phonetic symbols used when pronouncing English vowels according to pronunciation positions.

한편, 발음 영상 생성 장치(10)는, 도 4의 (a)와 같은 표준 IPA 기호를 그대로 사용할 수도 있으나, 각 나라의 발화자의 발음 특성에 따라 표준 IPA를 변형하여 사용할 수 있다. 한 예로서, 도 4의 (b)는 한국인이 영어의 모음을 발음할 때의 특성을 고려하여 수정된 IPA 모음 기호를 나타낸 것이다. Meanwhile, the pronunciation image generating apparatus 10 may use the standard IPA symbol as shown in FIG. 4A as it is, but may use a modified standard IPA according to the pronunciation characteristics of speakers of each country. As an example, (b) of FIG. 4 shows the IPA vowel symbols modified in consideration of the characteristics of Koreans pronouncing English vowels.

이때 도 4의 (b)는 한국인의 발음 기관의 특징, 예를 들어 한국인의 구강 크기, 성대 크기, 입술 모양 등의 발음 기관의 특징에 따라 수정된 것일 수 있고, 수정된 IPA 발음 기호의 위치는 해당 모음을 발음할 때의 구강 내 혀의 위치에 따라 변경될 수 있다. 수정된 IPA 기호는 표 1과 같을 수 있다. At this time, (b) of FIG. 4 may be modified according to the characteristics of Korean pronunciation organs, for example, Korean oral size, vocal cord size, lip shape, etc., the position of the modified IPA pronunciation symbol is The vowel may be changed according to the position of the tongue in the oral cavity when pronouncing the corresponding vowel. The modified IPA symbol may be as shown in Table 1.

일련번호Serial Number 표준 IPAstandard IPA 수정된 IPAModified IPA 1One αα αː'αː' 22 ææ ææ 33

4

5

6

표 1에서, 발화자가 발음한 음소를 표시할 때, 가 아니라 αː'로 바뀐 기호를 사용할 수 있다. 이는 발화자가 한국인인 경우 한국인의 발음 특성상 α로 발음되어야 하는 영어를, α로 발음하지 못하고 αː'로 발음하는 것을 의미한다. 따라서 α가 포함된 텍스트를 발음할 때 음소 리스트는 α가 아닌 αː'를 포함할 것이다.In Table 1, when indicating the phoneme pronounced by the speaker, the symbol changed to αː' can be used instead of αː. This means that, when the speaker is Korean, English, which should be pronounced as α due to the characteristics of Korean pronunciation, cannot be pronounced as α and is pronounced as αː'. Therefore, when pronouncing a text containing α, the phoneme list will contain αː', not α.

는 한국인의 발음 기관의 특성상, 표준 발음에 맞게 제대로 발음할 수 있는 음소로서,

를 그대로 사용할 수 있다.

is a phoneme that can be properly pronounced according to the standard pronunciation due to the characteristics of the Korean pronunciation system.

can be used as is.

한편 표 1에서는 발음된 음소가 모음인 경우를 나타내었으며, 자음을 발음한 경우에는 발화자가 발음한 음소의 기호와 함께 표준 IPA 기호에 비해 유성음 또는 무성음의 경향성, 파열음, 마찰음, 파찰음의 강도의 정도 등을 추가로 포함시킬 수 있다. On the other hand, Table 1 shows the case in which the pronounced phoneme is a vowel. In the case of pronunciation of a consonant, the tendency of voiced or unvoiced sounds, plosives, fricatives, and fricatives compared to standard IPA symbols along with the symbol of the phoneme pronounced by the speaker. and the like may be further included.

또한, 발화자의 발음 특성에 따라 임의의 음소가 여러 발음으로 소리날 수 있으므로, 하나의 음소에 대해 여러 발음이 존재할 수 있고 이를 구분하여 표현하기 위해 하나의 음소가 복수개의 수정된 IPA를 가질 수 있다. 예를 들어 aI라는 발음이 n개의 서로 다른 발음으로 소리날 수 있고, 이를 표현하기 위해 aI', aI²'부터 aIⁿ'까지 n개의 수정된 IPA 모음 기호를 사용할 수 있다.In addition, since an arbitrary phoneme may be sounded with multiple pronunciations according to the pronunciation characteristics of the speaker, multiple pronunciations may exist for one phoneme, and one phoneme may have a plurality of modified IPAs to distinguish and express them. . For example, the pronunciation of aI may be sounded with n different pronunciations, and n modified IPA vowel symbols ^{from aI', aI 2} ' to aI ^{n ' may be used to express this.}

발음 영상 생성 장치(10)는 수집한 발음 영상 및 발음 기관 단면도 영상에 수정한 음소 정보, 발화자 정보, 발음 기관 특징을 레이블링하여 학습 데이터를 생성한다(S130).The pronunciation image generating apparatus 10 generates learning data by labeling the collected pronunciation images and the cross-sectional image of the pronunciation organs with the corrected phoneme information, the speaker information, and the characteristics of the pronunciation organs ( S130 ).

발음 영상에 레이블링하는 정보는 음소 정보뿐만 아니라 발화자 정보와 발음 기관 특징을 포함한다. 즉 해당 발음 영상의 주인공인 발화자의 성별, 연령대 및 발화자가 해당 음소를 발음할 때의 발음 기관의 특징 등으로, 표 2와 같은 값들이 레이블링될 수 있다. The information for labeling the pronunciation image includes not only phoneme information but also speaker information and pronunciation organ characteristics. That is, the values shown in Table 2 may be labeled as the gender, age, and characteristics of a pronunciation organ when the speaker pronounces the phoneme, which is the main character of the pronunciation image.

구분division 내용Contents 타입type 성별gender 여성/남성female/male F/MF/M 연령age 아동/10대/20~30대/40~60대/70대 이상Children / 10s / 20s to 30s / 40s to 60s / 70s and older 1~51-5 발음기관 특징Pronunciation organ characteristics 구강/성대 크기mouth/vocal cord size 작다/보통/크다Small/Medium/Large 1~31-3 입술/이 모양lips/tooth shape 옆으로 벌어진 입모양/동그란 입모양/크게 벌어진 입술/작게 벌어진 입술/닫힌 입술/이가 닿음Side open mouth/round mouth shape/largely open lips/small gaping lips/closed lips/tooth touching 1~71-7 혀 위치tongue position 전설/중설/후설/혀안닿음/윗니뒤/입천장 안쪽/위아랫니사이/아랫니뒤잇몸Legend/Medium/House tongue/Long tongue/Behind upper teeth/Inside of roof of mouth/Between upper and lower teeth/Behind lower teeth 1~81 to 8

표 2에서, 성별은 F와 M으로 레이블링되고, 연령대는 아동부터 70대 이상까지 5개의 타입으로 레이블링되고, 발음 기관 특징은 각 특성에 따라 레이블링될 수 있다. In Table 2, gender is labeled with F and M, age is labeled with five types ranging from children to over 70 years old, and pronunciation organ features can be labeled according to each characteristic.

구체적으로, 구강과 성대의 크기는 3가지 범주로 구분될 수 있고, 입술과 이 모양에 따른 발음 기관의 특징은 도 5의 (a)와 같이 구분될 수 있다. 예를 들어, 도 5의 (a)에서 입술과 이 모양은 앞에서부터 순서대로 a, e, i, o, u, f 또는 v를 발음할 때의 모양일 수 있다. Specifically, the size of the oral cavity and the vocal cords may be divided into three categories, and the characteristics of the lips and the pronunciation organs according to the shape may be divided as shown in FIG. 5A . For example, in FIG. 5A , the lips and this shape may be a, e, i, o, u, f, or v when pronouncing a, e, i, o, u, f, or v in order from the front.

또한 음소를 발음하는 혀의 위치는 도 5의 (b)와 같이 구분될 수 있다. 예를 들어 도 5의 (b)는 앞에서부터 순서대로 전설, 중설, 후설, ‘r’을 발음할 때, ‘l’을 발음할 때, ‘ㄹ’을 발음할 때의 혀의 위치일 수 있다.Also, the position of the tongue pronouncing the phoneme may be divided as shown in FIG. 5( b ). For example, (b) of FIG. 5 may be the position of the tongue when pronouncing legend, midheol, husserl, 'r', 'l', and 'ㄹ' in order from the front. .

한편 도 4를 통해 설명한 바와 같이, 발음 영상 생성 장치(10)는 음소 정보 DB(600)에 저장된 표준 IPA를 그대로 사용하거나 이를 수정하여 사용할 수 있다. Meanwhile, as described with reference to FIG. 4 , the pronunciation image generating apparatus 10 may use the standard IPA stored in the phoneme information DB 600 as it is or modify it.

레이블링 할 정보가 준비되었으므로, 음소별 영상에 해당 정보를 레이블링해야 한다. 그러나 일반적으로 발음 영상과 발음 기관 단면도 영상은 하나의 문장을 발화하는 경우가 많으므로, 우선 발음 영상 생성 장치(10)는 도 6과 같이 영상을 음소 단위로 분할해야 한다. Now that the information to be labeled is prepared, it is necessary to label the corresponding information on the image for each phoneme. However, in general, since a pronunciation image and a cross-sectional image of a pronunciation organ often utter a single sentence, the pronunciation image generating apparatus 10 must first divide the image into phoneme units as shown in FIG. 6 .

도 6을 참고하면, 발음된 문장이 'I like an apple.'이고, 'apple'의 음소가 'aepl'이다. 이 문장 전체를 발음하는 영상을 'ae', 'p', 'l'이라는 각각의 음소들을 발음하는 영상들로 분할해야 한다. Referring to FIG. 6 , the pronounced sentence is 'I like an apple.' and the phoneme of 'apple' is 'aepl'. The video for pronouncing the entire sentence should be divided into videos for pronouncing the respective phonemes 'ae', 'p', and 'l'.

문장 단위의 음성을 음소로 나누기 위해, 관리자가 수동으로 분할하거나 발음 영상 생성 장치(10)가 분할을 위한 별도의 모델을 사용할 수 있다. 한 예로서 해당 모델은 각 음소별 영상에서 음향 특징을 추출한 것을 학습 데이터로 하여, 음소별 음향 특징을 학습하고, 임의의 발화 영상을 입력하면, 해당 문장에서 발음된 음소를 출력하는 모델일 수 있다. 이때 사용되는 음향 특징은 MFCC(Mel Frequency Cepstral Coefficient)일 수 있으며, 어느 하나에 한정되지 않는다. In order to divide the speech of the sentence unit into phonemes, an administrator may manually divide or the pronunciation image generating apparatus 10 may use a separate model for segmentation. As an example, the model may be a model that uses an image obtained by extracting acoustic features from each phoneme as learning data, learns acoustic features for each phoneme, and outputs a phoneme pronounced in a corresponding sentence when an arbitrary speech image is input. . In this case, the acoustic feature used may be a Mel Frequency Cepstral Coefficient (MFCC), and is not limited thereto.

이제 발음 영상 생성 장치(10)는 음소 단위로 분할된 영상에, S120 단계에서 수정한 음소 정보, 발화자 정보, 발음 기관 특징을 레이블링한다. 레이블링할 때, 해당 영상 파일의 재생 시간, 발화한 내용의 텍스트를 더 포함할 수 있다. 레이블링된 정보는 JSON(JavaScript Object Notation) 형태로 표현될 수 있다. Now, the pronunciation image generating apparatus 10 labels the phoneme information, speaker information, and pronunciation organ characteristics corrected in step S120 on the image divided into phoneme units. When labeling, the playback time of the corresponding video file and text of the uttered content may be further included. The labeled information may be expressed in a JSON (JavaScript Object Notation) format.

예를 들어 특정 영상의 특정 구간에서 ae45-1이라는 음소가 발음된 경우, 해당 구간의 영상 정보(image), 발음된 전체 문장의 정보(text), 발화자 정보(talker), 발음 기관 정보(pronunciation organ)를 포함하여 표 3과 같은 데이터 구조로 음소 정보를 생성할 수 있다. For example, when the phoneme ae45-1 is pronounced in a specific section of a specific video, image information of the section, information on the entire pronunciation (text), talker information, and pronunciation organ information (pronunciation organ) ) and can generate phoneme information in the data structure shown in Table 3.

“æ45-1": {
“image": {
“Frame length": "2500"
“Frame data": "1500", "1700"
“Video resolution": 1024", "768"
},
“text": {
“script": "I like an apple"
“phone": "æ"
},
"voice" : {
“length": "2500"
“ipa phone": "æ45-1"
“ipa phone difference: "-0.01", "+0.02"
},
"Talker": {
“Gender": "female"
“Age": "40"
},
“Pronunciation organ": {
“timbre": "7"
“Jaw length": "10"
“Neck length": "12"
“Neck thickness": "15"
“Oral cavity": "25"
“vocal tract length": "15"
“vocal tract diameter": "2"
“Lip size": "5"
“Lip shape": "type 1"
“Tongue position”(Type, Type 변화): “type 1”, “1, 1”
“Tongue movements”(높이): “1”
“sound source strength": "5"
}
}“æ45-1”: {
“image”: {
“Frame length”: “2500”
“Frame data”: “1500”, “1700”
“Video resolution”: 1024", "768"
},
"text": {
"script": "I like an apple"
“phone”: “æ”
},
"voice" : {
"length": "2500"
“ipa phone”: “æ45-1”
“ipa phone difference: "-0.01", "+0.02"
},
"Talker": {
“Gender”: “female”
“Age”: “40”
},
“Pronunciation organ”: {
"timbre": "7"
“Jaw length”: “10”
“Neck length”: “12”
“Neck thickness”: “15”
“Oral cavity”: “25”
“vocal tract length”: "15"
“vocal tract diameter": "2"
“Lip size”: “5”
“Lip shape”: “type 1”
“Tongue position” (Type, Type change): “type 1”, “1, 1”
“Tongue movements” (height): “1”
“sound source strength”: “5”
}
}

표 3에서, ipa phone differenc는 도 4를 참고하여, 발음된 모음이 표준 IPA의 발음 위치와 얼마나 가까운지를 나타내는 것이다. In Table 3, ipa phone differenc indicates how close the pronounced vowel is to the pronunciation position of the standard IPA with reference to FIG. 4 .

예를 들어 표 3에서 발화자가 발음한 음소를 æ45-1로 나타낼 수 있으며, 이는 표준 IPA 기호인 æ의 위치에 비해 모음 사각도에서 x축 방향으로 0.01만큼 왼쪽에 위치하고, y축 방향으로 0.02만큼 높게 위치함을 알 수 있다. 한편 이러한 수치는 도 4의 모음 사각도에서, 발화자가 발음한 음소의 위치와 표준 IPA 기호의 위치의 차이는 한국인의 구강 내 특성에 따라 결정된 것일 수 있다. For example, in Table 3, the phoneme pronounced by the speaker can be expressed as æ45-1, which is positioned 0.01 to the left in the x-axis direction and 0.02 in the y-axis direction in the vowel squareness relative to the position of the standard IPA symbol æ. It can be seen that it is located high. Meanwhile, in the vowel square diagram of FIG. 4 , the difference between the position of the phoneme pronounced by the speaker and the position of the standard IPA symbol may be determined according to the characteristics of the oral cavity of Koreans.

한편, 발음 영상 생성 장치(10)는 위에서 설명한 음소 정보, 발화자 정보, 발음 기관 정보 이외에 S110 단계에서 수집한 영상으로부터 추출한 영상학적 특징과 음성 특징들을 학습 데이터로 이용할 수 있다. On the other hand, the pronunciation image generating apparatus 10 may use, as learning data, imaging features and voice features extracted from the image collected in step S110 in addition to the phoneme information, the speaker information, and the pronunciation organ information described above.

예를 들어, 영상에서 Histogram of Oriented Gradients(HOG) 또는 Local Binary Pattern(LBP) 등의 특징을 추출하여 이용할 수 있다. 추출한 정보들은 표 3의 데이터 구조에 추가되어 레이블링 될 수 있다. For example, features such as Histogram of Oriented Gradients (HOG) or Local Binary Pattern (LBP) may be extracted from an image and used. The extracted information can be labeled by adding it to the data structure of Table 3.

발음 영상 생성 장치(10)는 학습 데이터를 이용하여 입력된 문장의 음소를 추출하기 위한 음소 분석 모델(210)을 학습시킨다(S140). 한 예로서 음소 분석 모델(210)은 도 7과 같이 음소가 순차적으로 연결된 네트워크 형태를 출력하도록 학습될 수 있다.The pronunciation image generating apparatus 10 trains the phoneme analysis model 210 for extracting phonemes of the input sentence using the training data (S140). As an example, the phoneme analysis model 210 may be trained to output a network form in which phonemes are sequentially connected as shown in FIG. 7 .

한 예로서, 도 7에서 학습자가 'I like an apple.'이라는 문장을 발음한 경우, 음소 분석 모델(210)은 문장을 구성하는 음소들이 무엇인지 확률로서 결정하기 위해 네트워크를 생성할 수 있다. As an example, when the learner pronounces the sentence 'I like an apple.' in FIG. 7 , the phoneme analysis model 210 may generate a network to determine what the phonemes constituting the sentence are as a probability.

예를 들어, 가장 먼저 발음된 음소가 aIⁿ'인 경우, 그 이후에 발음될 수 있는 음소는 lⁿ'또는 lⁿ⁺¹'일 수 있다. lⁿ'이라는 음소 다음에는 aIⁿ⁺³'이 발음될 수 있고, lⁿ⁺¹'이라는 음소 다음에는 aIⁿ⁺¹'이라는 음소가 발음될 수 있다. For example, when a phoneme pronounced first is aI ⁿ ', a phoneme that can be pronounced after that may be l ⁿ ' or l ⁿ⁺¹ '. ^{AI n+3} ' may be pronounced after the phoneme l ⁿ ', and the phoneme ^{aI n+1} ' may be pronounced after the phoneme ^{l n+1 '.}

음소 분석 모델(210)은 데이터베이스(미도시)에 저장된 미리 수집된 문장 코퍼스(Corpus)를 학습 데이터로 사용할 수 있고, n-gram 언어 모델로 구현될 수 있다. 구체적으로, SRI Language Modeling Toolkit(SRLIM)을 이용할 수 있다. The phoneme analysis model 210 may use a pre-collected sentence corpus stored in a database (not shown) as training data, and may be implemented as an n-gram language model. Specifically, the SRI Language Modeling Toolkit (SRLIM) may be used.

발음 영상 생성 장치(10)는 학습 데이터를 이용하여 3D 영상 생성을 위한 발음 영상 모델(220)을 학습시킨다(S150).The pronunciation image generating apparatus 10 learns the pronunciation image model 220 for generating a 3D image by using the learning data (S150).

발음 영상 모델(220)은 발음 기관을 3D 동영상으로 출력하는 역할을 하며, 입술, 이, 목젖, 입천장 등의 조음기관의 구조를 표현한다. 구강, 인두, 입술, 혀의 크기와 모양, 위치는 음소를 발음할 때마다 변화할 수 있다. 발음 기관의 3D 형태는 실제 사람의 발음 기관의 비율과 유사한 형태로 생성될 수 있으며, 발음과 무관환 눈, 코, 귀 등의 외형은 임의로 설정되거나 변경될 수 있다. The pronunciation image model 220 serves to output the pronunciation organs as a 3D video, and expresses the structures of the articulation organs such as lips, teeth, uvula, and palate. The size, shape, and position of the mouth, pharynx, lips, and tongue can change each time a phoneme is pronounced. The 3D shape of the pronunciation organ may be generated in a shape similar to the ratio of the actual human pronunciation organ, and the appearance of the eyes, nose, and ears, etc., regardless of pronunciation, may be arbitrarily set or changed.

학습 데이터는 2D 또는 3D의 발음 영상과 2D 또는 3D의 발음 기관 단면도 영상에 레이블링 된 표 3의 정보들을 포함한다. The training data includes information in Table 3 labeled on a 2D or 3D pronunciation image and a 2D or 3D pronunciation organ cross-sectional image.

발음 영상 모델(220)은 머신러닝으로 구현될 수 있으며, 복수의 머신러닝이 결합된 형태일 수 있다. 예를 들어 은닉 마르코프 모델(Hidden Markov Models, HMMs), 심층 신경망 (Deep Neural Network, DNN), 합성곱 신경망 (Convolution Neural Network, CNN) 등이 사용될 수 있다. The pronunciation image model 220 may be implemented by machine learning, and may have a form in which a plurality of machine learning is combined. For example, Hidden Markov Models (HMMs), Deep Neural Networks (DNNs), Convolution Neural Networks (CNNs), etc. may be used.

한편 도 1에서 설명한 바와 같이, S140 단계에서 학습되는 발음 영상 모델(220)과 S150 단계에서 학습되는 음소 분석 모델(210)은 하나의 인공지능 모델일 수 있으며, 이 경우 모델의 학습 단계는 하나의 과정으로 진행될 수 있다. 또한, 각 모델이 별개의 학습 과정을 진행한 뒤, 임의의 인공지능 모델을 통해 두 모델이 결합할 수도 있으며, 모델의 형태는 어느 하나로 제한되지 않는다.Meanwhile, as described in FIG. 1 , the pronunciation image model 220 learned in step S140 and the phoneme analysis model 210 learned in step S150 may be one artificial intelligence model, and in this case, the learning step of the model is one process can proceed. In addition, after each model undergoes a separate learning process, the two models may be combined through an arbitrary artificial intelligence model, and the form of the model is not limited to any one.

이하에서는 학습된 음소 분석 모델(210)과 발음 영상 모델(220)을 이용하여 학습자가 발화한 문장에 대한 발음 기관 단면도로 나타나는 3D 영상과 해당 문장의 표준 발음 영상을 출력하는 과정에 대해 설명한다. Hereinafter, a process of outputting a 3D image displayed as a cross-sectional view of a pronunciation organ for a sentence uttered by a learner using the learned phoneme analysis model 210 and the pronunciation image model 220 and a standard pronunciation image of the sentence will be described.

도 8은 한 실시예에 따른 발음 영상 생성 장치가 동작하는 방법의 흐름도이고, 도 9는 한 실시예에 따른 음소 분석 모델이 예측한 음소 리스트의 예시도이고, 도 10은 한 실시예에 따른 발음 영상 모델이 시뮬레이션 영상을 생성하는 방법의 설명도이고, 도 11과 12는 한 실시예에 따른 발음 영상 생성 장치가 출력하는 결과 화면의 예시도이다.8 is a flowchart of a method of operating an apparatus for generating a pronunciation image according to an embodiment; FIG. 9 is an exemplary diagram of a phoneme list predicted by a phoneme analysis model according to an embodiment; and FIG. 10 is a pronunciation according to an embodiment It is an explanatory diagram of a method in which an image model generates a simulation image, and FIGS. 11 and 12 are exemplary views of a result screen output by the apparatus for generating a pronunciation image according to an embodiment.

도 8을 참고하면, 발음 영상 생성 장치(10)의 음성 및 영상 입력부(100)는 학습자의 발화 영상을 입력받는다(S210). 학습자가 임의의 문장을 발화하면, 발음 영상 생성 장치(10)는 카메라와 마이크를 통해 학습자가 발화할 때의 발음 기관의 모양과 음성을 수집한다. 발화 영상은 학습자를 정면에서 촬영한 영상과 측면에서 촬영한 영상 중 적어도 하나를 포함할 수 있다. Referring to FIG. 8 , the audio and image input unit 100 of the pronunciation image generating apparatus 10 receives a learner's speech image ( S210 ). When the learner utters an arbitrary sentence, the pronunciation image generating apparatus 10 collects the shape and voice of the pronunciation organ when the learner speaks through the camera and the microphone. The speech image may include at least one of an image photographed from the front of the learner and an image photographed from the side.

발음 영상 생성 장치(10)는 학습자 발화를 음소 분석 모델(210)에 입력하여 학습자 발화를 구성하는 음소들의 리스트를 생성한다(S220). 이때 Viterbi 알고리즘이 사용될 수 있다. The pronunciation image generating apparatus 10 inputs the learner's speech into the phoneme analysis model 210 to generate a list of phonemes constituting the learner's speech ( S220 ). In this case, the Viterbi algorithm may be used.

Viterbi 알고리즘은 HMM 등에서 관측된 사건들의 순서를 야기한 가장 가능성 높은 은닉 상태들의 순서를 찾기 위한 알고리즘이다. 음성 인식 기술(Speech to Text)에서는, 음향 신호를 관측된 사건들의 순서라고 하면, 문자열은 이러한 음향 신호를 야기한 숨겨진 원인(Hidden Cause)으로 간주된다. 즉, Viterbi 알고리즘은 주어진 음향 신호에 대한 가장 가능성 높은 문자열을 찾아내는데 사용된다.The Viterbi algorithm is an algorithm for finding the sequence of the most probable hidden states that caused the sequence of events observed in HMM, etc. In Speech to Text, if an acoustic signal is a sequence of observed events, a string is considered as the Hidden Cause that caused this acoustic signal. That is, the Viterbi algorithm is used to find the most probable character string for a given acoustic signal.

구체적으로, 발음 영상 생성 장치(10)는 발음된 문장을 구성할 수 있는 음소들을 추출하고, 각 음소가 해당할 확률을 음소별로 계산하고, 현재 음소에 대해 다음에 나올 음소의 확률을 음소별로 계산한다. 이후 발음 영상 생성 장치(10)는 가장 큰 확률을 갖는 네트워크를, 해당 문장을 구성하는 음소 리스트로 결정할 수 있다.In detail, the pronunciation image generating apparatus 10 extracts phonemes that may constitute a pronounced sentence, calculates the probability that each phoneme corresponds to each phoneme, and calculates the probability of a next phoneme with respect to the current phoneme for each phoneme. do. Thereafter, the pronunciation image generating apparatus 10 may determine the network having the greatest probability as the phoneme list constituting the corresponding sentence.

도 9를 참고하여 한 예로서, 학습자가 ‘I like an apple.'이라는 문장을 발화한 경우를 설명한다. 발음 영상 생성 장치(10)는 공지된 음성 인식 기술(Speech to Text)을 이용하여, 학습자가 발화한 문장이 ‘I like an apple.'이라고 결정한다. 이후, ‘I like an apple.'이라는 문장을 구성할 수 있는 음소들을 네트워크 형태로 구성할 수 있다. As an example with reference to FIG. 9 , a case in which the learner utters the sentence 'I like an apple.' will be described. The pronunciation image generating apparatus 10 determines that the sentence uttered by the learner is 'I like an apple.' by using a known speech recognition technology (Speech to Text). After that, phonemes that can compose the sentence 'I like an apple.' can be configured in a network form.

학습된 음소 분석 모델(210)은 학습자가 발음한 문장이 aIⁿ⁺²', lⁿ⁺³', aIⁿ⁺³', kⁿ⁺²',

, nⁿ⁺³', æⁿ⁺⁴', pⁿ⁺⁴', lⁿ⁺³'이라는 음소들이 순차적으로 발음된 것으로 분석할 수 있다. The learned phoneme analysis model 210 determines that the sentences pronounced by the learner are aI ⁿ⁺² ', l ⁿ⁺³ ', aI ⁿ⁺³ ', k ⁿ⁺² ',

, n ⁿ⁺³ ', æ ⁿ⁺⁴ ', p ⁿ⁺⁴ ', l ⁿ⁺³ ' can be analyzed as being pronounced sequentially.

한편, 발음 영상 생성 장치(10)는 추가적으로 학습자가 각 음소를 발음한 길이를 추출할 수 있다. 발음 영상 생성 장치(10)는 음소별 발음 길이를 반영하여 발음 영상을 생성할 수 있다.Meanwhile, the pronunciation image generating apparatus 10 may additionally extract the length at which the learner pronounces each phoneme. The pronunciation image generating apparatus 10 may generate a pronunciation image by reflecting the pronunciation length for each phoneme.

발음 영상 생성 장치(10)는 학습자 발화의 음성 인식 결과에 따른 표준 발음 음소 리스트를 생성한다(S230). 학습된 음소 분석 모델(210)이 표준 발음 음소 리스트를 생성할 수 있다.The pronunciation image generating apparatus 10 generates a standard pronunciation phoneme list according to the speech recognition result of the learner's utterance (S230). The learned phoneme analysis model 210 may generate a standard phoneme list.

발음 영상 생성 장치(10)는 발음 영상 모델(220)을 이용하여 S220 단계에서 생성된 음소 리스트에 따른 학습자 발화에 대한 발음 기관 영상을 생성하여 출력한다(S240).The pronunciation image generating apparatus 10 generates and outputs a pronunciation organ image for a learner's utterance according to the phoneme list generated in step S220 using the pronunciation image model 220 ( S240 ).

발음 영상 생성 장치(10)는 발음 영상 모델(220)을 이용하여 S230 단계에서 생성된 표준 발음 음소 리스트에 대한 표준 발음 영상을 생성하여 출력한다(S250).The pronunciation image generating apparatus 10 generates and outputs a standard pronunciation image for the standard pronunciation phoneme list generated in step S230 by using the pronunciation image model 220 ( S250 ).

도 10을 참고하면, 발음 영상 모델(220)은 해당 음소의 표준 발음 영상을 바탕으로 표준 발음으로 발화하기 위한 입 모양, 혀의 위치 등의 발음 기관의 형태를 결정할 수 있다. 결정된 발음 기관의 특성과 발음 기관의 단면도 영상을 결합하여 3D 발음 영상을 표준 발음 영상으로 생성할 수 있다.Referring to FIG. 10 , the pronunciation image model 220 may determine the shape of a pronunciation organ, such as a mouth shape and a position of a tongue, for uttering a standard pronunciation based on a standard pronunciation image of a corresponding phoneme. A 3D pronunciation image may be generated as a standard pronunciation image by combining the determined characteristics of the pronunciation organ and the cross-sectional image of the pronunciation organ.

한 예로서 발음 영상 생성 장치(10)는 도 11 또는 도 12와 같은 화면을 출력할 수 있으며, 학습자는 출력된 화면에서 제공되는 버튼들을 눌러 표준 발음 발음과 학습자의 발음의 차이를 구체적으로 확인할 수 있다. As an example, the pronunciation image generating apparatus 10 may output a screen as shown in FIG. 11 or FIG. 12, and the learner can confirm the difference between the standard pronunciation pronunciation and the learner’s pronunciation in detail by pressing buttons provided on the output screen. have.

발음 영상 생성 장치(10)는 학습자 영상과 표준 발음 영상의 차이점을 분석하여 발음 가이드를 제공한다(S260). 학습자는 발음 가이드를 바탕으로, 표준 발음 영상과 자신의 발음 영상을 시청하며 발음을 스스로 교정할 수 있다.The pronunciation image generating apparatus 10 provides a pronunciation guide by analyzing the difference between the learner image and the standard pronunciation image (S260). Based on the pronunciation guide, learners can correct their pronunciation by watching the standard pronunciation video and their own pronunciation video.

제공되는 발화 가이드는 미리 저장된 DB(미도시)로부터 얻거나 발음 영상 생성 장치(10)가 표준 발음과 학습자 발음의 차이를 직접 계산하여 작성할 수 있다. DB에 미리 저장된 가이드는 음소의 차이를 교정하기 위해 학습자가 행동해야 할 지침을 포함하며, 표 4와 같을 수 있다. The provided speech guide may be obtained from a pre-stored DB (not shown), or the pronunciation image generating apparatus 10 may directly calculate the difference between the standard pronunciation and the learner's pronunciation and write it. The guide stored in the DB in advance includes instructions for the learner to act in order to correct the phoneme difference, and may be as shown in Table 4.

표준 발음 음소standard phoneme 학습자 발음 음소learner pronunciation phoneme 발음 가이드pronunciation guide aIaI aIⁿ⁺²'aI ⁿ⁺² ' 학습자는 혀의 위치를 저중설에서 고전설로 발음함
입을 벌리고 혀가 저후설에서 근고전설로 움직이도록 발음해야 함Learners pronounce tongue position from low-medium to classical.
Must be pronounced with mouth open and tongue moving II lⁿ⁺³'l ⁿ⁺³ ' 입을 살짝 벌리고 혀가 윗니 바로 뒤쪽에 닿도록 해야 함The mouth should be slightly open and the tongue should touch just behind the upper teeth. aIaI aIⁿ⁺³'aI ⁿ⁺³ ' 입모양을 확장해야 함Mouth shape needs to be expanded kk kⁿ⁺²'k ⁿ⁺² ' "으"발음을 빼고 혀를 고전설로 놓고 터트리는 소리로 짧게 발음해야 함You have to drop the "U" sound and put your tongue in the classic way and pronounce it briefly with a popping sound.

The learner pronounces the position of the tongue with the middle and low tongue.
Tongue position should be pronounced with a heavy tongue n n ⁿ⁺³ ' The learner pronounces the tongue with the front part of the tongue on the upper gum.
Press the front part of the tongue firmly against the upper gums, while blowing air through the nose. æ æ ⁿ⁺⁴ ' The learner has the back part of the tongue raised upwards and the mouth is wide open.
Pronunciation with the middle part of the tongue up p p ⁿ⁺⁴ ' Learners pronounce with their mouths closed and slightly opened.
Close your mouth and open it slightly to the side l l ⁿ⁺³ ' The learner touches the tongue to the roof of the mouth
Tongue should touch behind teeth

발음의 차이점을 계산하는 한 예로서, 표준 발음 음소와 학습자가 발음한 음소의 유사도를 계산할 수 있다. 유사도는 도 4와 같이 모음 사각도에서 음소의 위치를 좌표로 나타내어 계산될 수 있다.As an example of calculating the difference in pronunciation, the similarity between a standard phoneme and a phoneme pronounced by the learner may be calculated. The degree of similarity may be calculated by representing the position of the phoneme in the vowel square as coordinates as shown in FIG. 4 .

한편, 발음의 길이에 차이가 있는 경우 길이에 대한 내용이 발음 가이드에 추가될 수 있다.On the other hand, if there is a difference in the length of the pronunciation, the content of the length may be added to the pronunciation guide.

도 13은 한 실시예에 따른 컴퓨팅 장치의 하드웨어 구성도이다.13 is a hardware configuration diagram of a computing device according to an embodiment.

도 13을 참고하면, 음성 및 영상 입력부(100), 학습부(200), 발음 영상 생성부(300), 발음 가이드 제공부(400), 발음 영상 DB(500) 그리고 음소 정보 DB(600)는 적어도 하나의 프로세서에 의해 동작하는 컴퓨팅 장치(700)에서, 본 발명의 동작을 실행하도록 기술된 명령들(instructions)이 포함된 프로그램을 실행한다. Referring to FIG. 13 , the audio and image input unit 100 , the learning unit 200 , the pronunciation image generation unit 300 , the pronunciation guide providing unit 400 , the pronunciation image DB 500 and the phoneme information DB 600 are In the computing device 700 operated by at least one processor, a program including instructions described to execute an operation of the present invention is executed.

컴퓨팅 장치(700)의 하드웨어는 적어도 하나의 프로세서(710), 메모리(720), 스토리지(730), 통신 인터페이스(740)를 포함할 수 있고, 버스를 통해 연결될 수 있다. 이외에도 입력 장치 및 출력 장치 등의 하드웨어가 포함될 수 있다. 컴퓨팅 장치(700)는 프로그램을 구동할 수 있는 운영 체제를 비롯한 각종 소프트웨어가 탑재될 수 있다.The hardware of the computing device 700 may include at least one processor 710 , a memory 720 , a storage 730 , and a communication interface 740 , and may be connected through a bus. In addition, hardware such as an input device and an output device may be included. The computing device 700 may be loaded with various software including an operating system capable of driving a program.

프로세서(710)는 컴퓨팅 장치(700)의 동작을 제어하는 장치로서, 프로그램에 포함된 명령들을 처리하는 다양한 형태의 프로세서(710)일 수 있고, 예를 들면, CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit) 등 일 수 있다. 메모리(720)는 본 발명의 동작을 실행하도록 기술된 명령들이 프로세서(710)에 의해 처리되도록 해당 프로그램을 로드한다. 메모리(720)는 예를 들면, ROM(read only memory), RAM(random access memory) 등 일 수 있다. 스토리지(730)는 본 발명의 동작을 실행하는데 요구되는 각종 데이터, 프로그램 등을 저장한다. 통신 인터페이스(740)는 유/무선 통신 모듈일 수 있다.The processor 710 is a device that controls the operation of the computing device 700 and may be a processor 710 of various types that processes instructions included in a program, for example, a central processing unit (CPU), an MPU ( It may be a micro processor unit), a micro controller unit (MCU), a graphic processing unit (GPU), or the like. The memory 720 loads the corresponding program so that the instructions described to execute the operation of the present invention are processed by the processor 710 . The memory 720 may be, for example, read only memory (ROM), random access memory (RAM), or the like. The storage 730 stores various data, programs, etc. required to execute the operation of the present invention. The communication interface 740 may be a wired/wireless communication module.

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiment of the present invention described above is not implemented only through the apparatus and method, and may be implemented through a program for realizing a function corresponding to the configuration of the embodiment of the present invention or a recording medium in which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improved forms of the present invention are also provided by those skilled in the art using the basic concept of the present invention as defined in the following claims. is within the scope of the right.

Claims

A method of operating a computing device operated by at least one processor, comprising:
A step of receiving the learner pronunciation voice and video for arbitrary text,
generating a list of standard phonetic phonemes of the text;
generating a learner pronunciation phoneme list by extracting phonemes from which the text is pronounced from the learner pronunciation voice and image; and
By inputting the standard pronunciation phoneme list into a learned image generation model, a standard pronunciation simulation image for pronouncing the standard pronunciation phoneme list is generated, and the learner pronunciation voice and image, and the learner pronunciation phoneme list are input to the image generation model. generating a learner pronunciation simulation image for pronouncing the learner pronunciation phoneme list, and outputting the standard pronunciation simulation image and the learner pronunciation simulation image,
The standard pronunciation simulation image and the learner pronunciation simulation image are virtual pronunciation organ images simulating a pronunciation process.

In claim 1,
analyzing differences in the shape or movement of each pronunciation organ included in the standard pronunciation simulation image and the learner pronunciation simulation image, and generating a pronunciation guide including the differences
Further comprising, the method of operation.

In claim 1,
Collecting a first image including the shape of the oral cavity for each phoneme and a second image including the shape of the section of the pronunciation organ for each phoneme;
generating learning data by tagging the phoneme corresponding to the first image or the second image, speaker information including the speaker's gender and age group, and the shape of the oral cavity and the shape of the pronunciation organ when the phoneme is pronounced; and
learning the relationship between the shape of the oral cavity and the shape of the section of the pronunciation organ for each phoneme using the learning data to the image generation model;
Further comprising, the method of operation.

In claim 3,
The shape of the oral cavity includes at least one of the size of the oral cavity, the shape of the lips, whether the lower teeth and the upper teeth are in contact with each other, or the position of the tongue.

In claim 3,
The phoneme is displayed as a phonetic symbol, and the phonetic symbol is an International Phonetic Alphabet (IPA) transformed according to the speaker information.

In claim 5,
The learning data is
The method of claim 1, further comprising: a value representing a difference between a phonetic symbol modified according to the speaker information and the international phonetic symbol in a numerical value.

A computing device comprising:
memory, and
at least one processor executing instructions of a program loaded into the memory;
the program is
Learning images including the shape of the oral cavity and the shape of the section of the pronunciation organ for each phoneme are collected, and the speaker information including the phoneme pronounced in the learning image, the standard phoneme of the pronounced phoneme, and the speaker's gender and age is taught by tagging generating data;
learning an image generation model for outputting a virtual pronunciation organ simulation image in which a standard phoneme of an input phoneme is pronounced, as the learning data;
inputting the learner's pronunciation voice and image for arbitrary text into the image generation model; and
displaying a standard pronunciation simulation image of the text output from the image generation model on a screen
A computing device comprising instructions written to execute

In claim 7,
The learning data is
Computing device further comprising histogram information or feature point information extracted from the learning image.

In claim 7,
The learning step is,
Calculate the probability of a phoneme relationship that can be pronounced consecutively,
The displaying step is
and generating a standard phoneme list of the text and a pronunciation phoneme list of the text based on the probability.

In claim 9,
generating, by the virtual pronunciation organ, a learner pronunciation simulation image pronouncing the text
Computing device further comprising a.

In claim 9,
Analyze and provide a difference between the standard phoneme list and the phoneme list pronounced in the text
Computing device further comprising a.