KR20210152687A

KR20210152687A - Method of evaluating english pronunciation using encoder-decoder attention difference of speech transformer

Info

Publication number: KR20210152687A
Application number: KR1020200069483A
Authority: KR
Inventors: 김광호; 김덕규
Original assignee: 김광호; 김덕규
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2021-12-16

Abstract

An English pronunciation evaluation method according to one embodiment of the present invention may comprise: a step of learning English pronunciation of a native speaker, and a step of evaluating English pronunciation of a non-native speaker based on the learned English pronunciation of a native speaker. According to the present invention, English pronunciation of a non-native speaker can be evaluated by using an artificial recognition deep-learning technology.

Description

A method for evaluating English pronunciation using the encoder-decoder attention difference of speech transformers {METHOD OF EVALUATING ENGLISH PRONUNCIATION USING ENCODER-DECODER ATTENTION DIFFERENCE OF SPEECH TRANSFORMER}

본 출원은, 스피치 트랜스포머의 인코더-디코더 어텐션 차이를 활용한 영어 발음 평가 방법에 관한 것이다.The present application relates to a method for evaluating English pronunciation using a difference in the encoder-decoder attention of a speech transformer.

산업의 국제화 추세에 따라 제2 외국어에 대한 관심이 많아지고 있으며, 이러한 추세에 대응하기 위해 어학용 프로그램으로 외국어 발음 평가 방법들의 연구 개발이 필요하다.According to the internationalization trend of the industry, interest in a second foreign language is increasing, and in order to respond to this trend, it is necessary to research and develop methods for evaluating foreign language pronunciation as a language study program.

본 발명의 일 실시 형태에 의하면, 인공 인식 딥러닝 기술을 활용하여 비원어민의 영어 발음을 평가할 수 있는 스피치 트랜스포머의 인코더-디코더 어텐션 차이를 활용한 영어 발음 평가 방법을 제공한다.According to an embodiment of the present invention, there is provided a method for evaluating English pronunciation using an encoder-decoder attention difference of a speech transformer capable of evaluating the English pronunciation of a non-native speaker using artificial recognition deep learning technology.

본 발명의 일 실시 형태에 의하면, 원어민의 영어 발음열 학습 단계와, 학습된 상기 원어민의 영어 발음열을 기초로, 비원어민의 열어 발음을 평가하는 단계를 포함하는, 영어 발음 평가 방법이 제공된다.According to an embodiment of the present invention, there is provided a method for evaluating English pronunciation, comprising the steps of: learning the English pronunciation sequence of a native speaker; and evaluating the open pronunciation of a non-native speaker based on the learned English pronunciation sequence of the native speaker. .

본 발명의 일 실시 형태에 의하면, 인공 인식 딥러닝 기술을 활용하여 비원어민의 영어 발음을 평가할 수 있다.According to one embodiment of the present invention, it is possible to evaluate the English pronunciation of non-native speakers by using artificial recognition deep learning technology.

도 1은 본 발명의 일 실시 형태에 따른 트랜스포머의 아키텍처를 도시한 도면이다.
도 2는 본 발명의 일 실시 형태에 따른 원어민의 어텐션 매트릭스와 비원어민의 어텐션 매트릭스를 이용한 영어 발음 평가 과정을 도시한 도면이다.1 is a diagram illustrating an architecture of a transformer according to an embodiment of the present invention.
2 is a diagram illustrating an English pronunciation evaluation process using a native speaker's attention matrix and a non-native speaker's attention matrix according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 실시형태를 설명한다. 그러나 본 발명의 실시형태는 여러 가지의 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명하는 실시형태로만 한정되는 것은 아니다. 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있으며, 도면상의 동일한 부호로 표시되는 요소는 동일한 요소이다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. However, the embodiment of the present invention may be modified in various other forms, and the scope of the present invention is not limited only to the embodiments described below. The shapes and sizes of elements in the drawings may be exaggerated for a clearer description, and elements indicated by the same reference numerals in the drawings are the same elements.

도 1은 본 발명의 일 실시 형태에 따른 트랜스포머의 아키텍처를 도시한 도면이다. 한편, 도 2는 본 발명의 일 실시 형태에 따른 원어민의 어텐션 매트릭스와 비원어민의 어텐션 매트릭스를 이용한 영어 발음 평가 과정을 도시한 도면이다.1 is a diagram illustrating an architecture of a transformer according to an embodiment of the present invention. Meanwhile, FIG. 2 is a diagram illustrating an English pronunciation evaluation process using a native speaker's attention matrix and a non-native speaker's attention matrix according to an embodiment of the present invention.

본 출원에서 제안하고자 하는, 트랜스포머(transformer)의 인코더-디코더 어텐션 차이(Encoder-Decoder Attention Difference) 측정을 위해서는, 다음의 두 단계로 구성된다. In order to measure an encoder-decoder attention difference of a transformer, which is proposed in the present application, the following two steps are performed.

(1) 원어민(Native Speaker)의 영어 발음열 학습 단계(1) Native speaker's English pronunciation training stage

(2) 비원어민(Second Language Learner)의 영어 발음 평가 단계(2) Second Language Learner's English Pronunciation Evaluation Stage

음성인식 딥러닝 분야에서의 영어 발음열 학습과 발음열 평가(예측) 단계로 구성된다. It consists of English pronunciation heat learning and pronunciation heat evaluation (prediction) steps in the field of speech recognition deep learning.

(1) 원어민(Native Speaker)의 영어 발음열 학습 단계 (1) Native speaker's English pronunciation training stage

- 모델링 기법: 스피치 트랜스포머(Speech Transformer)의 아키텍처를 활용 - Modeling technique: utilizing the architecture of Speech Transformer

- 입력: 원어민(Native Speaker)의 영어 발성 문장 - Input: Sentences spoken by a native speaker

- 출력: 영어의 발음열 시퀀스 (영어발음열 정의: 약 84개) - Output: English pronunciation sequence (Definition of English pronunciation sequence: about 84)

- 발음열 정의는 CMU Dictionary를 활용하였다(출처: http://www.speech.cs.cmu.edu/cgi-bin/cmudict).- The pronunciation column was defined using the CMU Dictionary (source: http://www.speech.cs.cmu.edu/cgi-bin/cmudict ).

구체적으로, 트랜스포머(Transformer)는 자연어 처리 분야 및 음성인식 분야에서 딥러닝 모델로써 좋은 성능을 제시하고 있다. Specifically, Transformer presents good performance as a deep learning model in the field of natural language processing and speech recognition.

본 출원에서는 음성인식 트랜스포머(Transformer)의 구조를 활용하면서, 음성인식의 단어 인식단위를 영어 발음열 인식으로 변환하여 발음을 모델링한다. In the present application, the pronunciation is modeled by converting the word recognition unit of speech recognition into English pronunciation string recognition while utilizing the structure of the speech recognition transformer.

이는 원어민(Native Speaker)의 영어 발음을 모델링하는 단계이다. This is the stage of modeling the English pronunciation of native speakers.

원어민(Native Speaker)과 비원어민(Second Language Learner)간의 차이를 인코더-디코더 어텐션(Encoder-Decoder Attention)의 차이를 계산함으로써 발음 평가 결과를 제시하고자 한다. We present the pronunciation evaluation result by calculating the difference between the encoder-decoder attention and the difference between native speakers and non-native speakers.

도 1에서 오른쪽 중간부분의 인코더-디코더 어텐션(Encoder-Decoder Attention)의 결과물은 매 시간별 벡터(Vector) 형태로써, 매트릭스(Matrix) 형태를 보여주는데 본 특허에서는 인코더-디코더 어텐션(Encoder-Decoder Attention)을 의미하는 매트릭스(Matrix)간의 차이(Difference)를 구함으로써 최종 발음 평가의 결과를 제시하고자 한다. In Fig. 1, the result of the encoder-decoder attention in the right middle part is in the form of a vector for each time, showing the form of a matrix. In this patent, the encoder-decoder attention is The result of the final pronunciation evaluation is presented by finding the difference between the meaning matrices.

(2) 비원어민(Second Language Learner)의 영어 발음 평가 단계(단 원어민과 비원어민이 동일한 문장을 발성한다고 가정함) (2) Second Language Learner's English pronunciation evaluation stage (provided that the native speaker and the non-native speaker utter the same sentence)

- 원어민(Native Speaker)의 발성에 대한 인코더-디코더 어텐션 매트릭스(Encoder-Decoder Attention Matrix) 생성 (NS-M) - Generation of Encoder-Decoder Attention Matrix for native speaker's vocalization (NS-M)

- 비원어민(Second Language Learner)의 발성이 원어민(Native Speaker)의 발성 문장과 동일하다고 가정하고, Teacher Forcing 기법을 적용한 인코더-디코더 어텐션 매트릭스(Encoder-Decoder Attention Matrix) 생성 (SL-M) - Assuming that the vocalization of a second language learner is the same as that of a native speaker, generation of an Encoder-Decoder Attention Matrix with Teacher Forcing applied (SL-M)

- NS-M과 SL-M 사이의 차이(Difference)를 구한 후, - After finding the difference between NS-M and SL-M,

- 시그모이드(Sigmoid) 함수를 적용하여 0~1사이의 값으로 정규화(Normalization)를 수행하고, - Normalization is performed to a value between 0 and 1 by applying the sigmoid function,

- 평균 값을 구하여 영어 발음 평가 결과로 제시하였다.- The average value was obtained and presented as the result of English pronunciation evaluation.

상술한 바와 같이, 본 발명의 일 실시 형태에 의하면, 인공 인식 딥러닝 기술을 활용하여 비원어민의 영어 발음을 평가할 수 있다.As described above, according to an embodiment of the present invention, it is possible to evaluate the English pronunciation of non-native speakers by using artificial recognition deep learning technology.

상술한 본 발명의 일 실시 형태에 따른 스피치 트랜스포머의 인코더-디코더 어텐션 차이를 활용한 영어 발음 평가 방법은 컴퓨터에서 실행되기 위한 프로그램으로 제작되어 컴퓨터가 읽을 수 있는 기록 매체에 저장될 수 있다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고 상기 방법을 구현하기 위한 기능적인(function) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.The method for evaluating English pronunciation using the encoder-decoder attention difference of the speech transformer according to the embodiment of the present invention described above may be produced as a program to be executed on a computer and stored in a computer-readable recording medium. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. include In addition, the computer-readable recording medium is distributed in a computer system connected to a network, so that the computer-readable code can be stored and executed in a distributed manner. And a functional program, code, and code segments for implementing the method can be easily inferred by programmers in the art to which the present invention pertains.

본 발명은 상술한 실시형태 및 첨부된 도면에 의해 한정되지 아니한다. 첨부된 청구범위에 의해 권리범위를 한정하고자 하며, 청구범위에 기재된 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 다양한 형태의 치환, 변형 및 변경할 수 있다는 것은 당 기술분야의 통상의 지식을 가진 자에게 자명할 것이다.The present invention is not limited by the above-described embodiments and the accompanying drawings. It is intended to limit the scope of rights by the appended claims, and it is to those of ordinary skill in the art that various types of substitutions, modifications and changes can be made without departing from the technical spirit of the present invention described in the claims. it will be self-evident

100: 트랜스포머의 아키텍처100: Architecture of Transformers

Claims

The steps of learning English pronunciation of native speakers,
and evaluating the open pronunciation of a non-native speaker based on the learned English pronunciation sequence of the native speaker.