KR20210122070A

KR20210122070A - Voice synthesis apparatus and method for 'Call me' service using language feature vector

Info

Publication number: KR20210122070A
Application number: KR1020210024742A
Authority: KR
Inventors: 정승환
Original assignee: (주)라이언로켓
Priority date: 2020-02-03
Filing date: 2021-02-24
Publication date: 2021-10-08
Also published as: KR102352986B1; KR20210122071A; KR102222597B1; KR102401243B1; KR20210122069A; KR102352987B1

Abstract

The present invention relates to a voice synthesis apparatus of a "call me" service and a method thereof. To this end, provided is a name pronunciation information generating module for receiving name text information from a name text information receiving module and generating name pronunciation information converted from the received name text information into a pronunciation character string. According to an embodiment of the present invention, there is an effect of enabling a service for calling a user with a voice of a speaker for various speakers' voices and images.

Description

TECHNICAL FIELD [0002] Voice synthesis apparatus and method for 'Call me' service using language feature vector}

본 발명은 언어 특성 벡터를 이용한 콜미 서비스의 음성 합성 장치 및 방법에 관한 것이다. The present invention relates to a voice synthesis apparatus and method for a call-me service using a language feature vector.

음성합성(speech synthesis)은 인위적으로 사람의 소리를 합성하여 만들어내는 것으로, 텍스트를 음성으로 변환하는 텍스트 음성변환(text-to-speech, TTS)이 대표적인 음성합성 방법으로 이용되고 있다.Speech synthesis is artificially created by synthesizing human sounds, and text-to-speech (TTS), which converts text into speech, is used as a representative speech synthesis method.

기존의 TTS 알고리즘은 주로 Unit selection approach가 활용되고 있었고, 이러한 알고리즘은 입력되는 텍스트 정보에서 linguistic feature를 추출하고 waveform으로 붙여넣어 출력하는 접근이었다. 여기서의 핵심은 오디오 세그먼트들을 기존의 waveform dictionary와 어떻게 잘 매칭할지가 관건이었다. 이후 Statistical parametic approach가 출현하였으며, Statistical parametric approach는 추출된 linguistic feature를 확률적으로 acoustic feature로 변환하는 모듈을 더 포함한 접근이었지만, 음질이나 노이즈 문제가 많아 상용화되기는 어려웠다.The existing TTS algorithm mainly uses the unit selection approach, and this algorithm extracts the linguistic feature from the input text information and pastes it into a waveform to output it. The key here was how well to match the audio segments with the existing waveform dictionary. Afterwards, the statistical parametic approach appeared, and the statistical parametric approach was an approach that further included a module that converts the extracted linguistic features into acoustic features, but it was difficult to commercialize due to problems with sound quality and noise.

딥러닝의 출현 이후 기존의 TTS와 달리 End-to-end approach가 발달하게 되었고, 대표적인 알고리즘으로 Wavenet, Tacotron, Tacotron2 등이 있다. 이러한 딥러닝 알고리즘들은 음절이나 음소의 임베딩 벡터(linguistic feature)를 생성하는 인코딩 단계, linguistic feature를 확률적으로 acoustic feature로 변환하는 디코딩 단계, acoustic feature를 waveform으로 변환하는 Vocoder 단계를 포함하며, 각각의 단계에서 RNN/LSTM 등의 딥러닝이 활용될 수 있다. After the advent of deep learning, an end-to-end approach has been developed unlike the existing TTS, and representative algorithms include Wavenet, Tacotron, and Tacotron2. These deep learning algorithms include an encoding step of generating an embedding vector (linguistic feature) of a syllable or phoneme, a decoding step of probabilistically converting a linguistic feature into an acoustic feature, and a Vocoder step of converting an acoustic feature into a waveform. In this step, deep learning such as RNN/LSTM can be utilized.

대한민국 등록특허 10-1716013, 다수의 언어들에 대한 콘텐츠의 음성 합성 처리, 애플 인크.Republic of Korea Patent Registration 10-1716013, Speech synthesis processing of content for multiple languages, Apple Inc.

하지만, 이러한 TTS 서비스들은 단순히 텍스트에서 특정 음성 특성을 지니는 음성을 출력하는데에 그치고 있고, 사용자 측면에서 감정적으로 와닿고 인간적인 교감을 주는 서비스는 출현하지 못하고 있는 실정이다. However, these TTS services simply output a voice having a specific voice characteristic from text, and a service that emotionally touches the user and gives human interaction has not emerged.

따라서, 본 발명의 목적은 다양한 화자의 음성/영상에 사용자의 이름에 대한 음성을 합성함으로써 사용자가 깊이 교감할 수 있는 콜미 서비스를 위한 음성 합성 장치 및 방법을 제공하는 데에 있다. Accordingly, it is an object of the present invention to provide an apparatus and method for synthesizing a voice for a call-me service through which a user can have deep communion by synthesizing the voice of a user's name with the voice/video of various speakers.

이하 본 발명의 목적을 달성하기 위한 구체적 수단에 대하여 설명한다.Hereinafter, specific means for achieving the object of the present invention will be described.

본 발명의 목적은, 사용자 클라이언트(100) 또는 특정 서비스 서버(200)에서 사용자 또는 제3자에 대한 이름 텍스트 정보를 수신하는 이름 텍스트 정보 수신 모듈; 상기 이름 텍스트 정보 수신 모듈에서 상기 이름 텍스트 정보를 수신하고, 수신된 상기 이름 텍스트 정보가 발음 문자열 형태로 변환된 이름 발음 정보를 생성하는 이름 발음 정보 생성 모듈; 상기 이름 발음 정보 생성 모듈에서 생성된 상기 이름 발음 정보를 수신하고, 상기 이름 발음 정보를 기초로 상기 이름 발음 정보의 텍스트에 대한 언어 특성 벡터를 생성하는 언어 특성 벡터 생성 모듈; 상기 언어 특성 벡터 생성 모듈의 상기 언어 특성 벡터를 수신하고, 상기 언어 특성 벡터를 기초로 n개의 스펙트로그램 정보를 포함하는 음향 특성 벡터를 생성하는 음향 특성 벡터 생성 모듈; 상기 음향 특성 벡터 생성 모듈에서 출력된 상기 음향 특성 벡터를 수신하고, 상기 음향 특성 벡터를 Griffin-Lim 모듈에 입력하여 이름 음성 정보를 생성하는 이름 음성 정보 생성 모듈; 및 이름 영역이 기설정된 특정 화자의 음성 정보 또는 영상 정보의 상기 이름 영역에 상기 이름 음성 정보 생성 모듈에서 생성된 상기 이름 음성 정보를 삽입하여 상기 이름 음성 정보와 상기 특정 화자의 상기 음성 정보 또는 상기 영상 정보를 병합하는 이름 음성 정보 병합 모듈;을 포함하는, 콜미 서비스를 위한 음성 합성 장치를 제공하여 달성될 수 있다.An object of the present invention is to provide a name text information receiving module for receiving name text information for a user or a third party from a user client 100 or a specific service server 200; a name pronunciation information generating module that receives the name text information from the name text information receiving module and generates name pronunciation information in which the received name text information is converted into a pronunciation string form; a language characteristic vector generation module that receives the name pronunciation information generated by the name pronunciation information generation module and generates a language characteristic vector for the text of the name pronunciation information based on the name pronunciation information; an acoustic characteristic vector generation module that receives the language characteristic vector of the language characteristic vector generation module and generates an acoustic characteristic vector including n pieces of spectrogram information based on the language characteristic vector; a name speech information generation module that receives the acoustic property vector output from the acoustic property vector generation module and generates name speech information by inputting the acoustic property vector into a Griffin-Lim module; and inserting the name voice information generated by the name voice information generating module into the name region of the voice information or video information of a specific speaker in which a name region is preset, so that the name voice information and the voice information or the video of the specific speaker This may be achieved by providing a voice synthesis apparatus for a call-me service, including a name voice information merging module for merging information.

본 발명의 다른 목적은, 사용자 클라이언트 또는 특정 서비스 서버에서 사용자 또는 제3자에 대한 이름 텍스트 정보를 수신하는 이름 텍스트 정보 수신 모듈; 상기 이름 텍스트 정보 수신 모듈에서 상기 이름 텍스트 정보를 수신하고, 수신된 상기 이름 텍스트 정보가 발음 문자열 형태로 변환된 이름 발음 정보를 생성하는 이름 발음 정보 생성 모듈; 상기 이름 발음 정보 생성 모듈에서 생성된 상기 이름 발음 정보를 수신하고, 상기 이름 발음 정보를 기초로 상기 이름 발음 정보의 텍스트에 대한 언어 특성 벡터를 생성하는 언어 특성 벡터 생성 모듈; 상기 언어 특성 벡터 생성 모듈의 상기 언어 특성 벡터를 수신하고, 상기 언어 특성 벡터를 기초로 스펙트로그램 정보에 관한 음향 특성 벡터를 생성하는 음향 특성 벡터 생성 모듈; 상기 음향 특성 벡터 생성 모듈에서 출력된 상기 음향 특성 벡터를 수신하고, 상기 음향 특성 벡터를 이용하여 이름 음성 정보를 생성하는 이름 음성 정보 생성 모듈; 및 이름 영역이 기설정된 특정 화자의 음성 정보 또는 영상 정보인 병합 대상 콘텐츠 정보의 상기 이름 영역에 상기 이름 음성 정보 생성 모듈에서 생성된 상기 이름 음성 정보를 삽입하여 상기 이름 음성 정보와 상기 특정 화자의 상기 병합 대상 정보를 병합하고 병합 콘텐츠 정보를 생성하는 이름 음성 정보 병합 모듈;을 포함하여, 상기 특정 화자의 음성 특성으로 상기 이름 텍스트 정보를 호출하며 상기 병합 대상 콘텐츠 정보를 출력하도록 상기 병합 콘텐츠 정보를 생성하는 콜미 서비스를 구성하는 것을 특징으로 하는, 콜미 서비스를 위한 음성 합성 장치를 제공하여 달성될 수 있다. Another object of the present invention is to provide a name text information receiving module for receiving name text information about a user or a third party from a user client or a specific service server; a name pronunciation information generating module that receives the name text information from the name text information receiving module and generates name pronunciation information in which the received name text information is converted into a pronunciation string form; a language characteristic vector generation module that receives the name pronunciation information generated by the name pronunciation information generation module and generates a language characteristic vector for the text of the name pronunciation information based on the name pronunciation information; an acoustic characteristic vector generation module that receives the language characteristic vector of the language characteristic vector generation module and generates an acoustic characteristic vector related to spectrogram information based on the language characteristic vector; a name speech information generating module for receiving the acoustic property vector output from the acoustic property vector generating module and generating name speech information by using the acoustic property vector; and inserting the name voice information generated by the name voice information generating module into the name region of the content information to be merged, the voice information or video information of the specific speaker having a preset name region, so that the name voice information and the voice information of the specific speaker are inserted. A name voice information merging module for merging merging target information and generating merged content information; including, calling the name text information with the voice characteristics of the specific speaker and generating the merged content information to output the merging target content information This can be achieved by providing a voice synthesis apparatus for a call-me service, characterized in that the call-me service is configured.

또한, 상기 음향 특성 벡터 생성 모듈은, 상기 언어 특성 벡터를 기초로 생성된 콘텍스트 벡터와 특정 프레임의 상기 음향 특성 벡터를 입력으로 하고, 상기 특정 프레임 이후의 프레임의 상기 음향 특성 벡터를 출력으로 하는 인공신경망으로 구성되는 것을 특징으로 할 수 있다.In addition, the acoustic characteristic vector generation module may be configured to receive a context vector generated based on the language characteristic vector and the acoustic characteristic vector of a specific frame as inputs, and output the acoustic characteristic vector of a frame after the specific frame as an output. It may be characterized in that it is composed of a neural network.

또한, 상기 이름 발음 정보 생성 모듈은, 상기 이름 텍스트 정보를 발음 문자열 형태로 변환한 뒤, 유사 발음열을 통일하여 상기 이름 발음 정보를 생성하는 것을 특징으로 할 수 있다. In addition, the name pronunciation information generating module may be characterized in that the name text information is converted into a pronunciation string form, and then the similar pronunciation sequence is unified to generate the name pronunciation information.

또한, 상기 이름 텍스트 정보가 한글 이외의 언어인 외국어로 구성된 경우, 상기 이름 발음 정보 생성 모듈은, 외국어의 상기 이름 텍스트 정보를 한글의 상기 이름 텍스트 정보로 전환한 뒤에, 상기 이름 텍스트 정보를 발음 문자열 형태로 변환하여 상기 이름 발음 정보를 생성하는 것을 특징으로 할 수 있다. In addition, when the name text information is configured in a foreign language other than Korean, the name pronunciation information generating module is configured to convert the name text information of the foreign language into the name text information in Korean, and then convert the name text information into a pronunciation string It may be characterized in that the name pronunciation information is generated by converting it into a form.

본 발명의 다른 목적은, 이름 텍스트 정보 수신 모듈이, 사용자 클라이언트 또는 특정 서비스 서버에서 사용자 또는 제3자에 대한 이름 텍스트 정보를 수신하는 이름 텍스트 정보 수신 단계; 이름 발음 정보 생성 모듈이, 상기 이름 텍스트 정보 수신 모듈에서 상기 이름 텍스트 정보를 수신하고, 수신된 상기 이름 텍스트 정보가 발음 문자열 형태로 변환된 이름 발음 정보를 생성하는 이름 발음 정보 생성 단계; 언어 특성 벡터 생성 모듈이, 상기 이름 발음 정보 생성 모듈에서 생성된 상기 이름 발음 정보를 수신하고, 상기 이름 발음 정보를 기초로 상기 이름 발음 정보의 텍스트에 대한 언어 특성 벡터를 생성하는 언어 특성 벡터 생성 단계; 음향 특성 벡터 생성 모듈이, 상기 언어 특성 벡터 생성 모듈의 상기 언어 특성 벡터를 수신하고, 상기 언어 특성 벡터를 기초로 스펙트로그램 정보에 관한 음향 특성 벡터를 생성하는 음향 특성 벡터 생성 단계; 이름 음성 정보 생성 모듈이, 상기 음향 특성 벡터 생성 모듈에서 출력된 상기 음향 특성 벡터를 수신하고, 상기 음향 특성 벡터를 이용하여 이름 음성 정보를 생성하는 이름 음성 정보 생성 단계; 및 이름 음성 정보 병합 모듈이, 이름 영역이 기설정된 특정 화자의 음성 정보 또는 영상 정보인 병합 대상 콘텐츠 정보의 상기 이름 영역에 상기 이름 음성 정보 생성 모듈에서 생성된 상기 이름 음성 정보를 삽입하여 상기 이름 음성 정보와 상기 특정 화자의 상기 병합 대상 정보를 병합하고 병합 콘텐츠 정보를 생성하는 이름 음성 정보 병합 단계;을 포함하여, 상기 특정 화자의 음성 특성으로 상기 이름 텍스트 정보를 호출하며 상기 병합 대상 콘텐츠 정보를 출력하도록 상기 병합 콘텐츠 정보를 생성하는 콜미 서비스를 구성하는 것을 특징으로 하는, 콜미 서비스를 위한 음성 합성 방법을 제공하여 달성될 수 있다. Another object of the present invention is to provide, by a name text information receiving module, a name text information receiving step of receiving name text information about a user or a third party from a user client or a specific service server; a name pronunciation information generating step of, by a name pronunciation information generating module, receiving the name text information from the name text information receiving module, and generating name pronunciation information in which the received name text information is converted into a pronunciation string form; A language characteristic vector generating step, in which a language characteristic vector generation module receives the name pronunciation information generated by the name pronunciation information generation module, and generates a language characteristic vector for the text of the name pronunciation information based on the name pronunciation information ; an acoustic characteristic vector generating step in which an acoustic characteristic vector generation module receives the language characteristic vector of the language characteristic vector generation module, and generates an acoustic characteristic vector related to spectrogram information based on the language characteristic vector; a name voice information generating step of receiving, by a name voice information generating module, the acoustic characteristic vector output from the acoustic characteristic vector generating module, and generating name voice information by using the acoustic characteristic vector; and the name voice information merging module inserts the name voice information generated by the name voice information generating module into the name region of the content information to be merged, which is voice information or image information of a specific speaker in which the name region is set in advance to insert the name voice information. A name-voice information merging step of merging information and the merging target information of the specific speaker and generating merged content information; including, calling the name text information with the voice characteristics of the specific speaker and outputting the merging target content information It can be achieved by providing a voice synthesis method for a call-me service, characterized in that the call-me service for generating the merged content information is configured to do so.

또한, 상기 이름 발음 정보 생성 단계에서, 상기 이름 발음 정보 생성 모듈은 상기 이름 텍스트 정보를 발음 문자열 형태로 변환한 뒤, 유사 발음열을 통일하여 상기 이름 발음 정보를 생성하는 것을 특징으로 할 수 있다.In addition, in the generating of the name pronunciation information, the name pronunciation information generating module may generate the name pronunciation information by unifying the similar pronunciation strings after converting the name text information into a pronunciation string form.

또한, 상기 이름 텍스트 정보가 한글 이외의 언어인 외국어로 구성된 경우, 상기 이름 발음 정보 생성 단계에서, 상기 이름 발음 정보 생성 모듈은 외국어의 상기 이름 텍스트 정보를 한글의 상기 이름 텍스트 정보로 전환한 뒤에, 상기 이름 텍스트 정보를 발음 문자열 형태로 변환하여 상기 이름 발음 정보를 생성하는 것을 특징으로 할 수 있다.In addition, when the name text information is configured in a foreign language other than Korean, in the name pronunciation information generation step, the name pronunciation information generation module converts the name text information of the foreign language into the name text information in Korean, The name text information may be converted into a pronunciation string form to generate the name pronunciation information.

상기한 바와 같이, 본 발명에 의하면 이하와 같은 효과가 있다.As described above, according to the present invention, there are the following effects.

첫째, 본 발명의 일실시예에 따르면, 다양한 화자의 음성/영상에 대하여 해당 화자의 음성으로 사용자를 호출하는 서비스가 가능해지는 효과가 발생된다. First, according to an embodiment of the present invention, there is an effect of enabling a service for calling a user with the voice of the various speakers with respect to the voices/videos of the various speakers.

둘째, 본 발명의 일실시예에 따르면, 다양한 언어에 대하여 각각 발음규칙을 기설정하지 않아도 되는 효과가 발생된다.Second, according to an embodiment of the present invention, there is an effect that it is not necessary to preset pronunciation rules for various languages.

셋째, 본 발명의 일실시예에 따르면, 발음 통일 및 외국어의 한글 전환에 의해 음성 합성 엔진의 정확도가 향상되고, 필요 학습 데이터의 양이 저감되는 효과가 발생된다.Third, according to an embodiment of the present invention, the accuracy of the speech synthesis engine is improved and the amount of required learning data is reduced by the pronunciation unification and the conversion of the foreign language to Hangul.

본 명세서에 첨부되는 다음의 도면들은 본 발명의 바람직한 실시예를 예시하는 것이며, 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 발명은 그러한 도면에 기재된 사항에만 한정되어 해석되어서는 아니 된다.
도 1은 본 발명의 일실시예에 따른 콜미 서비스를 위한 음성 합성 장치를 도시한 모식도,
도 2, 3은 본 발명의 일실시예에 따라 이름 발음 정보 생성 모듈이 이름 발음 정보를 생성하는 흐름도를 도시한 모식도,
도 4는 본 발명의 다른 실시예에 따른 이름 발음 정보 생성 모듈의 발음 통일 인공신경망을 도시한 모식도,
도 5는 언어 특성 벡터 생성 모듈의 언어 특성 벡터 생성을 도시한 모식도,
도 6은 본 발명의 일실시예에 다른 음향 특성 벡터 생성 모듈의 음향 특성 벡터 생성을 도시한 모식도,
도 7은 본 발명의 일실시예에 따른 이름 음성 정보 병합 모듈의 이름 음성 정보의 병합을 도시한 모식도이다.The following drawings attached to the present specification illustrate preferred embodiments of the present invention, and serve to further understand the technical spirit of the present invention together with the detailed description of the present invention, so the present invention is limited only to the matters described in those drawings should not be interpreted as
1 is a schematic diagram showing a voice synthesis apparatus for a call-me service according to an embodiment of the present invention;
2 and 3 are schematic diagrams showing a flow chart for generating name pronunciation information by a name pronunciation information generating module according to an embodiment of the present invention;
4 is a schematic diagram showing a unified pronunciation artificial neural network of a name pronunciation information generating module according to another embodiment of the present invention;
5 is a schematic diagram showing the generation of a language characteristic vector of a language characteristic vector generation module;
6 is a schematic diagram illustrating acoustic characteristic vector generation of an acoustic characteristic vector generation module according to an embodiment of the present invention;
7 is a schematic diagram illustrating merging of name voice information in the name voice information merging module according to an embodiment of the present invention.

이하 첨부된 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명을 쉽게 실시할 수 있는 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예에 대한 동작원리를 상세하게 설명함에 있어서 관련된 공지기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다.Hereinafter, with reference to the accompanying drawings, a person of ordinary skill in the art to which the present invention pertains will be described in detail an embodiment in which the present invention can be easily carried out. However, in the detailed description of the principle of operation of the preferred embodiment of the present invention, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

또한, 도면 전체에 걸쳐 유사한 기능 및 작용을 하는 부분에 대해서는 동일한 도면 부호를 사용한다. 명세서 전체에서, 특정 부분이 다른 부분과 연결되어 있다고 할 때, 이는 직접적으로 연결되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고, 간접적으로 연결되어 있는 경우도 포함한다. 또한, 특정 구성요소를 포함한다는 것은 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다.In addition, the same reference numerals are used throughout the drawings for parts having similar functions and functions. Throughout the specification, when it is said that a specific part is connected to another part, this includes not only a case in which it is directly connected, but also a case in which it is indirectly connected with another element interposed therebetween. In addition, the inclusion of specific components does not exclude other components unless otherwise stated, but means that other components may be further included.

이하 발명의 설명에서, 한글은 낱소리(음소) 문자로서 자음(닿소리)과 모음(홑소리)으로 구성되며, 음절(소리마디)은 첫소리(초성), 가운뎃소리(중성), 끝소리(종성)의 낱소리(음소) 세 벌로 이루어지는데, 첫소리(초성)와 끝소리(종성)에는 닿소리(자음)를 쓰고 가운뎃소리(중성)에는 홀소리(모음)를 쓴다.In the following description of the invention, Hangeul is a single sound (phoneme) character, and is composed of a consonant (touch sound) and a vowel (one sound), and a syllable (somal word) is the first sound (initial sound), middle sound (neutral), and final sound (final sound). It consists of three sets of single sounds (phonemes), with the first (initial) and ending (final) sounds (consonants) being used, and the middle (neutral) sounds (vowels) are used.

현대 한글에서 단음을 내는 닿소리(자음)에는 ㄱ,ㄴ,ㄷ,ㄹ,ㅁ 등 14자가 있고, 홑소리(모음)에는 ㅏ,ㅑ, ㅓ, ㅕ 등 10자가 있다. 복음을 내는 겹닿소리(쌍자음)에는 ㄲ, ㄸ, ㅃ, ㅆ, ㅉ 의 5자가 있고, 겹홑소리(쌍모음)에는 ㅐ, ㅒ, ㅔ, ㅖ 등 11자가 있다. 또한, 현대 한글에서 끝소리(종성)가 있을 때 활용되는 받침은 홑받침 또는 곁받침이 있고, 홑받침에는 모든 닿소리(자음)가 쓰이며, 곁받침에는 ㄲ,ㅆ,ㄳ,ㄵ 등 13자가 있다. In modern Hangeul, there are 14 letters, such as ㄱ, ㄴ, ㄴ, ㄹ, ㅁ, for a single sound (consonant), and a single sound (vowel) has 10 characters, such as a, ㅑ, ㅓ, and ㅕ. The double consonants that make the gospel have 5 letters: ㄲ, ㄸ, ㅃ, ㅆ, and tk, and the double consonants (double vowels) have 11 letters such as ㄲ, ㅒ, ㅔ, and ㅖ. Also, in modern Hangeul, there are single consonants or side consonants used when there is a final sound (jongseong), all consonants (consonants) are used for single consonants, and there are 13 letters such as ㄲ, ㅆ, ㄳ, and ㄵ for side consonants.

현대 한글은 낱자를 엮어 11,172(첫소리 19 × 가운뎃소리 21 × (끝소리 27 + 끝소리 없음 1))글자 마디를 쓸 수 있다. 11,172자 중 399자는 무받침 글자이며 10,773자는 받침 글자이다.In modern Hangeul, 11,172 (first sound 19 × middle sound 21 × (last sound 27 + no end sound 1)) can be written by weaving letters. Of the 11,172 characters, 399 are unsupported characters and 10,773 are base characters.

어문 규정에 의하여, 현대 한국어 표준어에서 실제 사용하는 음절은 이보다 적다. 한국어의 소리는 첫소리+가운뎃소리(+끝소리)로 이루어지는데, 표준어에서 첫소리에는 19가지 닿소리가 모두 쓰이되 첫소리에 놓인 ㅇ은 소리 나지 않는다. 끝소리는 7종성법에 따라 7갈래로 모이며 끝소리가 없는 것까지 더하여 모두 8갈래이므로 현대 한국어의 발음은 첫소리 19 × 가운뎃소리 21 × 끝소리 8 = 3,192가지 소리가 된다. 표준 발음법을 따르면 구개음 ㅈ, ㅉ, ㅊ 뒤의 이중 모음 ㅑ, ㅒ, ㅕ, ㅖ, ㅛ, ㅠ는 단모음 ㅏ, ㅐ, ㅓ, ㅔ, ㅗ, ㅜ로 소리나므로 첫소리 3 × 가운뎃소리 6 × 끝소리 8 = 144소리가 빠지고, 아울러 소리나는 첫소리 (ㅇ이 아닌 첫소리 뒤에 오는)를 얹은 가운뎃소리 [ㅢ]는 ㄴ을 제외하면(ㄴ의 경우는 구개음화에 따른 다른 음소로 인정하고 있다.) [ㅣ]로 소리나므로(한글 맞춤법 제9항 및 표준 발음법 제5항 단서 3) 첫소리 17 × 가운뎃소리 1 × 끝소리 8 = 136 소리가 다시 빠진다. 따라서, 현재 한국어 표준어에서 실제 사용하는 소리마디는 3192 - 144 - 136 = 2,912가지가 된다.According to the grammar rules, the number of syllables actually used in the modern Korean standard language is less than this. Korean sounds consist of the first sound + the middle sound (+ ending sound). In the standard language, all 19 touch sounds are used for the first sound, but the ㅇ placed in the first sound does not sound. The final sounds are gathered into 7 groups according to the 7-bell consonant method, and there are 8 sounds including those without ending sounds, so the pronunciation of modern Korean is 19 × middle 21 × 8 ending sounds = 3,192 sounds. According to the standard pronunciation method, the double vowels ㅑ, ㅒ, ㅕ, ㅖ, ㅛ, and ㅠ after the palatal consonants are pronounced as short vowels a, ㅐ, ㅓ, ㅔ, ㅗ, TT. 8 = 144 sounds are omitted, and the middle sound [ㅢ] with the first sound (which comes after the first sound other than ㅇ) is added except for ㄴ (in case of ㄴ, it is recognized as a different phoneme according to palatalization) [ㅣ] Because it is pronounced as (Hangul Spelling Clause 9 and Standard Pronunciation Clause 5, Clue 3), the first 17 × middle 1 × final 8 = 136 sounds are dropped again. Therefore, there are 3192 - 144 - 136 = 2,912 sound words actually used in the current standard Korean language.

본 발명의 범위는 한글에 한정되지 않고 영어, 일본어, 중국어 등 다양한 국가의 언어로 콜미 서비스가 적용되는 범위를 포함할 수 있다. The scope of the present invention is not limited to Korean, and may include a range in which the call-me service is applied in various languages such as English, Japanese, and Chinese.

콜미 서비스를 위한 음성 합성 장치 및 방법Speech synthesis apparatus and method for call-me service

도 1은 본 발명의 일실시예에 따른 콜미 서비스를 위한 음성 합성 장치(1)를 도시한 모식도이다. 도 1에 도시된 바와 같이, 본 발명의 일실시예에 따른 콜미 서비스를 위한 음성 합성 장치(1)는, 이름 텍스트 정보 수신 모듈(10), 이름 발음 정보 생성 모듈(11), 언어 특성 벡터 생성 모듈(12), 음향 특성 벡터 생성 모듈(13), 이름 음성 정보 생성 모듈(14), 이름 음성 정보 병합 모듈(15)을 포함한다. 1 is a schematic diagram illustrating a voice synthesis apparatus 1 for a call-me service according to an embodiment of the present invention. As shown in Fig. 1, the speech synthesis apparatus 1 for a call-me service according to an embodiment of the present invention includes a name text information receiving module 10, a name pronunciation information generating module 11, and a language characteristic vector generation. a module 12 , an acoustic characteristic vector generation module 13 , a name speech information generation module 14 , and a name speech information merging module 15 .

이름 텍스트 정보 수신 모듈(10)는, 사용자 클라이언트(100) 또는 특정 서비스 서버(200)에서 사용자 또는 특정인에 대한 이름 텍스트 정보(문자열)를 수신하는 모듈이다. 이름 텍스트 정보는, 예를 들어, '김민아', 'Serena', 'Johnson', 'Daniel James' 등과 같이 성_이름, 이름, 성, 이름_성의 텍스트를 포함하는 정보를 의미한다. 사용자 클라이언트(100)에서는, 특정 웹사이트 또는 특정 애플리케이션의 인터페이스에 구성된 입력창에 사용자가 사용자 또는 특정인에 대한 이름 텍스트 정보를 입력하고, 입력된 이름 텍스트 정보를 사용자 클라이언트(100) 또는 해당 웹사이트 또는 해당 애플리케이션의 서비스 서버(200)에서 유무선 네트워크를 통해 이름 텍스트 정보 수신 모듈(10)에 송신하도록 구성될 수 있다. 특정 서비스 서버(200)에서는, 상기 서비스 서버(200)에서 서비스하는 특정 웹사이트 또는 특정 애플리케이션에서 사용자의 요청(콜미 서비스 요청, 특정 페이지 유입과 같은 기설정된 특정 액션 등)을 기초로 상기 서비스 서버(200)와 연결된 데이터베이스에 기저장된 사용자 또는 특정인에 대한 이름 텍스트 정보를 불러오고, 불러온 이름 텍스트 정보를 서비스 서버(200)에서 유무선 네트워크를 통해 이름 텍스트 정보 수신 모듈(10)에 송신하도록 구성될 수 있다. 이에 따르면, 사용자가 직접 이름을 입력하여 콜미 서비스를 요청하는 UX/UI 이외에도, 사용자의 선택에 의해 제3의 사용자 계정에 대한 이름 텍스트 정보가 적용된 콜미 영상을 SMS나 KakaoTalk, Line, Facebook messanger 등의 메신저 애플리케이션, Facebook, Instagram, Snap, Snow 등의 소셜 미디어 애플리케이션을 통해 이모티콘이나 기프티콘과 유사한 UX/UI로 활용할 수 있게 되는 효과가 발생된다. The name text information receiving module 10 is a module for receiving name text information (string) for a user or a specific person from the user client 100 or the specific service server 200 . First name text information means information including texts of last name_first name, first name, last name, first name_surname, for example, 'Minah Kim', 'Serena', 'Johnson', 'Daniel James', and the like. In the user client 100, the user inputs name text information for a user or a specific person in an input window configured in the interface of a specific website or specific application, and the input name text information is displayed on the user client 100 or the website or The service server 200 of the corresponding application may be configured to be transmitted to the name text information receiving module 10 through a wired/wireless network. In the specific service server 200, the service server ( 200) and fetches name text information for a user or a specific person pre-stored in the database connected to, and transmits the called name text information from the service server 200 to the name text information receiving module 10 through a wired/wireless network. have. According to this, in addition to the UX/UI where the user directly inputs a name to request the call-me service, the user selects the call-me video to which the name text information for the third user account is applied, such as SMS, KakaoTalk, Line, Facebook messanger, etc. It has the effect of being able to use it as a UX/UI similar to emoticons or gifticons through social media applications such as messenger applications, Facebook, Instagram, Snap, and Snow.

이름 발음 정보 생성 모듈(11)은 이름 텍스트 정보 수신 모듈(10)에서 이름 텍스트 정보를 수신하고, 수신된 이름 텍스트 정보가 발음 문자열 형태로 변환된 이름 발음 정보를 생성하는 모듈이다. 이름 발음 정보는, 예를 들어, '김미나', 'suruna', 'joanson', 'daeniul jaims' 등과 같이 발음되는 그대로의 문자열 형태로 변환된 텍스트 정보를 의미한다. 도 2, 3은 본 발명의 일실시예에 따라 이름 발음 정보 생성 모듈(11)이 이름 발음 정보를 생성하는 흐름도를 도시한 것이다. 도 2, 3에 도시된 바와 같이, 본 발명의 일실시예에 따른 이름 발음 정보 생성 모듈(11)의 이름 발음 정보 생성은, 음소 단위 분해 단계, 발음 규칙 적용 단계, 발음 통일 단계, 이름 발음 정보 생성 단계를 포함한 프로그램 코드로 수행될 수 있고, 발음 통일 인공신경망 및 한글 전환 인공신경망을 포함할 수 있다. The name pronunciation information generating module 11 is a module for receiving name text information from the name text information receiving module 10 and generating name pronunciation information in which the received name text information is converted into a pronunciation string form. Name pronunciation information means text information converted into a string form as it is pronounced, such as 'Mina Kim', 'suruna', 'joanson', 'daeniul jaims', and the like. 2 and 3 are flowcharts showing the name pronunciation information generating module 11 generating name pronunciation information according to an embodiment of the present invention. 2 and 3 , the name pronunciation information generation of the name pronunciation information generating module 11 according to an embodiment of the present invention includes a phoneme unit decomposition step, a pronunciation rule application step, a pronunciation unification step, and name pronunciation information. It may be performed as a program code including a generation step, and may include a unified pronunciation artificial neural network and a Hangul conversion artificial neural network.

음소 단위 분해 단계는, 이름 발음 정보 생성 모듈(11)이 수신된 이름 텍스트 정보를 음운론상의 최소 단위인 음소(낱소리)로 분해하여 음소 정보를 생성하는 단계이다. 음소는 해당 언어의 화자들이 동일하다고 인지하는 말소리들의 최소 묶음이며, 자음(또는, 자음 음소)과 모음(또는, 모음 음소)으로 구성될 수 있다. 이름 텍스트 정보가 한글인 경우, 음소 단위 분해 단계에서는 이름 텍스트 정보를 아래와 같은 자음 음소 및 모음 음소로 분해하여 음소 정보를 생성하도록 구성될 수 있다. 현대 한글에서 단음을 내는 닿소리(자음)에는 ㄱ,ㄴ,ㄷ,ㄹ,ㅁ 등 14자가 있고, 홑소리(모음)에는 ㅏ,ㅑ, ㅓ, ㅕ 등 10자가 있다. 복음을 내는 겹닿소리(쌍자음)에는 ㄲ, ㄸ, ㅃ, ㅆ, ㅉ 의 5자가 있고, 겹홑소리(쌍모음)에는 ㅐ, ㅒ, ㅔ, ㅖ 등 11자가 있다. 또한, 현대 한글에서 끝소리(종성)가 있을 때 활용되는 받침은 홑받침 또는 곁받침이 있고, 홑받침에는 모든 닿소리(자음)가 쓰이며, 곁받침에는 ㄲ,ㅆ,ㄳ,ㄵ 등 13자가 있다. The phoneme unit decomposition step is a step in which the name pronunciation information generating module 11 generates phoneme information by decomposing the received name text information into phonemes (words), which are the smallest units in phonology. A phoneme is a minimum bundle of speech sounds recognized as the same by speakers of a corresponding language, and may be composed of a consonant (or a consonant phoneme) and a vowel (or a vowel phoneme). When the name text information is Korean, the phoneme unit decomposition step may be configured to generate phoneme information by decomposing the name text information into consonant phonemes and vowel phonemes as follows. In modern Hangeul, there are 14 letters, such as ㄱ, ㄴ, ㄴ, ㄹ, ㅁ, for a single sound (consonant), and a single sound (vowel) has 10 characters, such as a, ㅑ, ㅓ, and ㅕ. The double consonants that make the gospel have 5 letters: ㄲ, ㄸ, ㅃ, ㅆ, and tk, and the double consonants (double vowels) have 11 letters such as ㄲ, ㅒ, ㅔ, and ㅖ. Also, in modern Hangeul, there are single consonants or side consonants used when there is a final sound (jongseong), all consonants (consonants) are used for single consonants, and there are 13 letters such as ㄲ, ㅆ, ㄳ, and ㄵ for side consonants.

발음 규칙 적용 단계는, 이름 발음 정보 생성 모듈(11)이 음소 단위 분해 단계에서 생성한 이름 텍스트 정보에 대한 음소 정보에 발음 규칙을 적용하여 상기 음소 정보의 일부를 발음열로 변환하여 발음열 정보를 생성하는 단계이다. 발음 규칙 적용 단계에서 이름 발음 정보 생성 모듈(11)은 경음화, 자음동화, 구개음화 등의 표준어 규정 제2부의 표준발음법의 발음 규칙을 기초로 음소 정보의 일부를 발음열로 변환하도록 구성될 수 있다. In the pronunciation rule application step, the name pronunciation information generating module 11 applies a pronunciation rule to the phoneme information for the name text information generated in the phoneme unit decomposition step, and converts a part of the phoneme information into a pronunciation sequence to generate pronunciation sequence information. This is the creation step. In the pronunciation rule application step, the name pronunciation information generation module 11 may be configured to convert a part of phoneme information into a pronunciation string based on the pronunciation rules of the standard pronunciation method of part 2 of the standard language regulations such as hard phonetics, consonant assimilation, palatalization, etc. .

이름 발음 정보 생성 모듈(11)에서 음소 정보에 적용하는 발음 규칙으로는 끝소리 규칙(받침소리로는 'ㄱ,ㄴ,ㄷ,ㄹ,ㅁ,ㅂ,ㅇ'의 7개 자음만으로 발음열 구성, 밖[ㅂㅏㄱ], 꽃[ㄲㅗㄷ], 무릎[ㅁㅜㄹㅡㅂ]), 연음 규칙(받침이 모음으로 시작되는 음소와 결합되는 경우에는 뒤 음절 첫소리로 옮겨서 발음열 구성, 밥이[ㅂㅏㅂㅣ], 꽃을[ㄲㅗㅊㅡㄹ]), 자음동화 제1규칙(받침 ㄱ(ㄲ, ㅋ, ㄱㅅ, ㄹㄱ), ㄷ(ㅅ, ㅆ, ㅈ, ㅊ, ㅌ, ㅎ), ㅂ(ㅍ, ㄹㅂ, ㄹㅍ, ㅂㅅ)은 'ㄴ,ㅁ' 앞에서 [ㅇ,ㄴ,ㅁ]으로 발음열 구성, 먹는[ㅁㅓㅇㄴㅡㄴ], 있는[ㅇㅣㄴㄴㅡㄴ], 앞마당[ㅇㅏㅁㅁㅏㄷㅏㅇ]), 자음동화 제2규칙(받침 'ㅁ,ㅇ' 뒤에 연결되는 'ㄹ'은 [ㄴ]으로 발음열 구성), 자음동화 제3규칙('ㄴ'은 'ㄹ'의 앞이나 뒤에서 [ㄹ]로 발음열 구성), 구개음화 규칙(받침 'ㄷ,ㅌ(ㄹㅌ)'이 모음'ㅣ'와 결합되는 경우에는 [ㅈ,ㅊ]로 바꾸어서 뒤 음절 첫소리로 옮겨 발음열 구성), 격음화 규칙('ㅂ,ㄷ,ㅈ,ㄱ'이 'ㅎ'의 앞 또는 뒤에 위치하는 경우 격음화하여 [ㅍ,ㅌ,ㅊ,ㅋ]으로 발음열 구성), 경음화 규칙(받침 ㄱ(ㄲ, ㅋ, ㄱㅅ, ㄹㄱ), ㄷ(ㅅ, ㅆ, ㅈ, ㅊ, ㅌ, ㅎ), ㅂ(ㅍ, ㄹㅂ, ㄹㅍ, ㅂㅅ)의 뒤에 연결되는 'ㄱ,ㄷ,ㅂ,ㅅ,ㅈ'는 된소리로 발음열 구성, 받침 'ㄴ(ㄴㅈ),ㅁ(ㄹㅁ),ㄹ'의 뒤에 결합되는 어미의 첫소리가 'ㄱ,ㄷ,ㅅ,ㅈ'인 경우에는 된소리로 발음열 구성), 사잇소리 'ㄴ' 첨가 규칙(앞 음절의 끝이 자음이고 뒤 음절이 '이,야,여,요,유'인 경우에는 'ㄴ'음을 첨가하여 [니,냐,녀,뇨,뉴]로 발음열 구성하고, 'ㄹ' 받침 뒤에 첨가되는 'ㄴ'음은 [ㄹ]로 발음열 구성), 사잇소리 'ㅅ' 첨가 규칙('ㄱ,ㄷ,ㅂ,ㅅ,ㅈ'로 시작하는 음절 앞에 사이시옷이 올 때는 사이시옷을 제외하고 발음열 구성(깃발[기빨])하고, 사이시옷 뒤에 'ㄴ,ㅁ'이 결합되는 경우에는 [ㄴ]으로 발음열 구성(콧물[콘물])하며, 사이시옷 뒤에 '이'가 결합되는 경우에는 [ㄴㄴ]으로 발음열 구성(깻잎[깬닙])) 등이 포함될 수 있다. As a pronunciation rule applied to phoneme information in the name pronunciation information generation module 11, the pronunciation string consists of only 7 consonants of 'ㄱ, ㄴ, ㄴ, ㄹ, ㅁ, ㅇ, ㅇ' as the end sound rule, outside [Vaa], flower[ㄲㅗ], knee[ㅁㅜㄹㅡㅡ]), linking rules (when the consonant is combined with a phoneme that starts with a vowel, it is moved to the first sound of the last syllable to form the pronunciation string, and the ], flower [ㄲㅗcㅡㄹ]), the first rule of consonant assimilation (consonant ㄱ (ㄲ, ㅋ, ㄱㅅ, ㄹㄱ), , ㄴ, ㅇㅇ) is composed of [ㅇ, ㄴ, ㅁ] in front of 'ㄴ, ㅁ', eating [ㅁㅓㅇㄴㅡㄴ], with [ㅇㅣㄴㄴㅡㄴ], front yard [ㅇㅇㅇㅇㅇㅇㅇ]), The second rule of consonant assimilation (‘ㄹ’ connected after the consonant ‘ㅁ, ㅇ’ is composed of [b]), the third rule of consonant assimilation (‘ㄴ’ is pronounced as [ㄹ] before or after ‘ㄹ’) Column composition), palatalization rules (when the consonant 'c, t (ㄹ )' is combined with a vowel 'ㅣ', change it to [c, ] and move it to the first sound of the last syllable to form the pronunciation sequence), aspirate rule ('b, c) When , , ㄱ' is located before or after 'ㅎ', it is aspirated and consists of [p, t, ch, ㅋ]), hard phoneticization rules (supporting ㄱ (ㄲ, ㅋ, ㄱㅅ, ㄹㄱ), c( 'A, C, F, S, Y', which is connected after ㅅ, ㅆ, ㅎ, ㅎ), and f (p, ㄹㅇ, ㄱ, ㅇㅅ), consists of a vowel string, and the consonant 'ㄴ (ㄴㅎ)' ), ㅁ ( ㅁ ), ㄹ ' , if the first sound of the ending is 'ㄱ, ㄴ, ㅅ, ㄱ', it is composed of a consonant string), rules for adding 'ㄴ' between the syllables (the end of the preceding syllable is a consonant) And if the last syllable is 'I, ya, yeo, yo, yu', the 'ㄴ' sound is added to form the pronunciation string [ni, nya, nyeo, yo, nu], and the ' ' added after the 'ㄹ' support ㄴ' sound consists of [ㄹ] in the pronunciation sequence), rules for adding 'ㅅ' between sounds (when sai ot comes before a syllable that starts with ' ㄱ, ㄴ, ㅇ, ㅅ, ㅎ, excludes sai ot and consists of a pronunciation sequence) (flag [gicheal]), and when 'ㄴ, ㅁ' is combined after si siot, it is composed of [ ] in the pronunciation string (runny nose [konmul]), As [ㄴㄴ], the composition of the pronunciation column (sesame leaf [kaen nib])), etc. may be included.

발음 통일 단계는, 이름 발음 정보 생성 모듈(11)이 발음 규칙 적용 단계에서 생성한 발음열 정보에서 중복되는 발음열(유사 발음열)을 통일하여 수정 발음열 정보를 생성하는 단계이다. 예를 들어, 발음 통일 단계에서는 'ㅞ', 'ㅚ', 'ㅙ' 등의 발음열을 'ㅙ'로 통일하여 수정 발음열 정보를 생성하게 된다. 일예로 'ㅇ ㅣ ㅁ ㅎ ㅖ ㅇ ㅕ ㄴ'의 음소 정보가 입력되면 'ㅁㅎ'를 'ㅁ'로 통일하고, 'ㅖ'를 유사 발음열인 'ㅐ'로 통일하여 'ㅇㅣㅁㅐㅇㅕㄴ'의 수정 발음열 정보를 생성하게 된다. 본 발명의 일실시예에 따른 발음 통일 단계에 따르면, 음향 특성 벡터 생성 모듈의 정확도 등의 성능이 향상되고 학습 시간이 단축되는 효과가 발생된다.The pronunciation unification step is a step in which the name pronunciation information generation module 11 unifies the pronunciation sequence (similar pronunciation sequence) that overlaps in the pronunciation sequence information generated in the step of applying the pronunciation rule to generate corrected pronunciation sequence information. For example, in the pronunciation unification step, corrected pronunciation sequence information is generated by unifying pronunciation strings such as 'ㅞ', 'ㅚ', and 'ㅙ' into 'ㅙ'. For example, when phoneme information of 'ㅇㅣ ㅁㅎ ㅖ ㅇ ㅕ ㄴ' is input, 'ㅁㅎ' is unified with 'ㅁ' and 'ㅖ' is unified into 'ㅐ', which is a similar pronunciation string, and 'ㅇㅣㅁㅐㅇㅕㄴ ' corrected pronunciation sequence information is generated. According to the pronunciation unification step according to an embodiment of the present invention, the performance such as accuracy of the acoustic characteristic vector generating module is improved and the learning time is shortened.

본 발명의 일실시예에 따른 이름 발음 정보 생성 모듈(11)은 음소 사이의 발음 거리를 기초로 생성된 유사 발음열을 이용하여 수정 발음열 정보를 생성하도록 구성될 수 있다. 음소 간의 발음 거리는 본 발명의 일실시예에 따른 음향 특성 벡터 생성 모듈(13)에 각 음소를 입력하여 생성되는 음향 특성 벡터 사이의 거리를 의미할 수 있다. 이에 따르면, 음향 특성 벡터 생성 모듈(13)의 학습에 활용된 화자의 음성 특성에 따라 유사 발음열이 화자 맞춤형으로 생성될 수 있는 효과가 발생된다.The name pronunciation information generating module 11 according to an embodiment of the present invention may be configured to generate corrected pronunciation sequence information using a similar pronunciation sequence generated based on a pronunciation distance between phonemes. The pronunciation distance between phonemes may mean a distance between acoustic characteristic vectors generated by inputting each phoneme to the acoustic characteristic vector generating module 13 according to an embodiment of the present invention. Accordingly, there is an effect that a similar pronunciation sequence can be generated according to the speaker's voice characteristic used for learning by the acoustic characteristic vector generating module 13 .

본 발명의 다른 실시예에 따른 이름 발음 정보 생성 모듈(11)은 입력 데이터를 발음열 정보로 하고 출력 데이터를 수정 발음열 정보로 하는 기학습된 Sequance to sequance 인공신경망 모듈인 발음 통일 인공신경망을 포함할 수 있다. 도 4는 본 발명의 다른 실시예에 따른 이름 발음 정보 생성 모듈의 발음 통일 인공신경망을 도시한 모식도이다. 도 4에서 <go>는 수정 발음열 정보의 시작, <eos>는 수정 발음열 정보의 끝을 나타내는 시그널이다. 도 3에 도시된 바와 같이, 본 발명의 일실시예에 따른 발음 통일 인공신경망은 입력 데이터를 발음열 정보로 하고 출력 데이터를 수정 발음열 정보로 하는 GRU(Gated recurrent unit) 셀 또는 LSTM(Long-Short Term Memory) 셀을 포함하는 순환신경망(RNN, Recurrent Neural Network)으로 구성될 수 있으며, 학습 단계에서는 입력된 발음열 정보에 의해 발음 통일 인공신경망에서 출력된 수정 발음열 정보와 음소 사이의 발음 거리를 기초로 생성된 유사 발음열을 이용하여 생성된 수정 발음열 정보(Ground truth)와의 차이를 기초로 한 Back Propagation 등의 방법으로 인공신경망을 학습시킬 수 있다. 특히, 발음 통일 인공신경망의 Loss function은 입력 데이터인 발음열 정보의 음향 특성 벡터(음향 특성 벡터 생성 모듈(13)에 발음열 정보를 입력하여 생성)와 출력 데이터인 수정 발음열 정보의 음향 특성 벡터(음향 특성 벡터 생성 모듈(13)에 수정 발음열 정보를 입력하여 생성)에 대한 MSE(Mean Squared Error) 또는 크로스 엔트로피(Cross Entropy) 등의 손실함수로 구성될 수 있다. 이에 따르면, 음소나 음절의 사용을 기초로 음소 또는 단어를 임베딩하는 Word2Vec 등의 임베딩 모듈로 임베딩된 발음열 정보 및 수정 발음열 정보의 거리(차이)를 저감시키는 방향으로 발음 통일 인공신경망 모듈이 학습되는 것이 아니라, 발음열 정보 및 수정 발음열 정보의 음성적인 차원에서의 거리(차이)를 저감시키는 방향으로 발음 통일 인공신경망 모듈이 학습되는 효과가 발생된다. The name pronunciation information generating module 11 according to another embodiment of the present invention includes a pronunciation unified artificial neural network that is a pre-learned sequence to sequence artificial neural network module that uses input data as pronunciation sequence information and output data as corrected pronunciation sequence information. can do. 4 is a schematic diagram illustrating a unified pronunciation neural network of a name pronunciation information generating module according to another embodiment of the present invention. In FIG. 4, <go> is a signal indicating the start of the corrected pronunciation sequence information, and <eos> is a signal indicating the end of the corrected pronunciation sequence information. As shown in FIG. 3 , the unified pronunciation neural network according to an embodiment of the present invention uses a Gated recurrent unit (GRU) cell or LSTM (Long- It can be composed of a Recurrent Neural Network (RNN) including Short Term Memory) cells, and in the learning stage, the pronunciation distance between phonemes and the corrected pronunciation sequence information output from the pronunciation unified neural network according to the input pronunciation sequence information. The artificial neural network can be trained by methods such as Back Propagation based on the difference from the ground truth generated using the similar pronunciation sequence generated based on . In particular, the loss function of the unified pronunciation neural network includes an acoustic characteristic vector of pronunciation sequence information as input data (generated by inputting pronunciation sequence information to the acoustic characteristic vector generating module 13) and an acoustic characteristic vector of corrected pronunciation sequence information as output data. It may be composed of a loss function such as Mean Squared Error (MSE) or Cross Entropy for (generated by inputting corrected pronunciation sequence information into the acoustic characteristic vector generation module 13). According to this, the pronunciation unified artificial neural network module learns in the direction of reducing the distance (difference) between the pronunciation sequence information and the modified pronunciation sequence information embedded with an embedding module such as Word2Vec, which embeds phonemes or words based on the use of phonemes or syllables. Rather, the effect of learning the unified pronunciation neural network module in the direction of reducing the distance (difference) in the phonetic dimension of the pronunciation sequence information and the corrected pronunciation sequence information is generated.

또한, 본 발명의 다른 실시예에 따르면, 발음 통일 인공신경망의 Loss function은 입력 데이터인 발음열 정보의 음향 특성 벡터(음향 특성 벡터 생성 모듈(13)에 발음열 정보를 입력하여 생성)와 출력 데이터인 수정 발음열 정보의 음향 특성 벡터(음향 특성 벡터 생성 모듈(13)에 수정 발음열 정보를 입력하여 생성)에 대한 MSE(Mean Squared Error) 또는 크로스 엔트로피(Cross Entropy) 등의 손실함수(단어 손실함수) 및 발음열 정보를 구성하는 음소 중 발음 통일의 대상이 되는 제1음소의 음향 특성 벡터와 수정 발음열 정보를 구성하는 음소 중 상기 제1음소가 발음 통일이 되어 변경된 제2음소의 음향 특성 벡터에 대한 대한 MSE(Mean Squared Error) 또는 크로스 엔트로피(Cross Entropy) 등의 손실함수(음소 손실함수)를 포함하는 손실함수로 구성될 수 있다. 이에 따르면, 발음열 정보와 수정 발음열 정보의 발음 유사도 및 치환된 음소의 발음 유사도를 모두 고려하여 발음 통일 인공신경망 모듈이 학습되는 효과가 발생된다. In addition, according to another embodiment of the present invention, the loss function of the unified pronunciation neural network includes an acoustic characteristic vector (generated by inputting pronunciation sequence information to the acoustic characteristic vector generation module 13) of pronunciation sequence information, which is input data, and output data. Loss functions (word loss) such as MSE (Mean Squared Error) or Cross Entropy for the acoustic characteristic vector of the modified pronunciation sequence information (generated by inputting the corrected pronunciation sequence information into the acoustic characteristic vector generation module 13) function) and the acoustic characteristic vector of the first phoneme, which is the subject of pronunciation unification among phonemes constituting the pronunciation sequence information, and the acoustic characteristic of the second phoneme, which is changed due to the unified pronunciation of the first phoneme among the phonemes configuring the corrected pronunciation sequence information It may be composed of a loss function including a loss function (phoneme loss function) such as a mean squared error (MSE) or cross entropy for a vector. According to this, there is an effect that the pronunciation unified neural network module is learned in consideration of both the pronunciation similarity of the pronunciation sequence information and the corrected pronunciation sequence information and the pronunciation similarity of the substituted phonemes.

본 발명의 다른 실시예에 따른 이름 발음 정보 생성 모듈(11)의 RNN 모듈의 hidden layer인 현재인 시간 t 상태(state)의 hidden vector는 이하 수학식 1과 같이 이전 시간인 t-1의 상태(state)와 입력 벡터의 함수로 업데이트 되도록 구성될 수 있다. The hidden vector of the current time t state, which is the hidden layer of the RNN module of the name pronunciation information generating module 11 according to another embodiment of the present invention, is the state of the previous time t-1 as shown in Equation 1 below. state) and can be configured to be updated as a function of the input vector.

위 수학식에서, h_t는 시간 t의 Hidden vector, W_hh는 이전 시간인 t-1의 Hidden vector h_t-1에 대한 가중치, w_xh는 시간 t의 입력 벡터 x_t에 대한 가중치, b_h는 상수를 의미할 수 있다. In the above equation, h _t is the hidden vector of time t, W _hh is the weight of the hidden vector h _t-1 _{of the previous time t-1, w xh} is the weight of the input vector x _t of time t, b _h is It can mean constant.

이에 따르면, 이름 텍스트 정보, 발음열 정보에 대하여 유사 발음열의 규칙을 모두 기설정하지 않아도, 다양한 이름 텍스트 정보 또는 발음열 정보에 대하여 수정 발음열 정보를 생성할 수 있게 되는 효과가 발생된다.Accordingly, there is an effect that modified pronunciation sequence information can be generated for various name text information or pronunciation sequence information without pre-setting all rules of similar pronunciation sequences with respect to name text information and pronunciation sequence information.

이름 발음 정보 생성 단계는, 이름 발음 정보 생성 모듈(11)이 발음 통일 단계의 수정 발음열 정보를 결합하여 이름 발음 정보를 생성하는 단계이다. 예를 들어, 수정 발음열 정보가 'ㅇ ㅣ ㅁ ㅐ ㅇ ㅕ ㄴ'인 경우, '이매연'으로 이름 발음 정보를 생성하게 된다. The name pronunciation information generation step is a step in which the name pronunciation information generation module 11 combines the corrected pronunciation sequence information of the pronunciation unification step to generate name pronunciation information. For example, when the corrected pronunciation sequence information is 'ㅇ ㅣ ㅁ ㅐ ㅇ ㅕ ㄴ', the name pronunciation information is generated as 'Lee Mae-yeon'.

본 발명의 일실시예에 따른 이름 발음 정보 생성 모듈(11)은 기저장된 발음 규칙을 기초로 아래의 단계로 이름 발음 정보를 생성하도록 구성될 수 있다. The name pronunciation information generating module 11 according to an embodiment of the present invention may be configured to generate name pronunciation information in the following steps based on a pre-stored pronunciation rule.

① [음소 단위 분해 단계] 수신된 이름 텍스트 정보를 음소 단위인 자모로 분해. 예를 들어, '김민아' -> 'ㄱ ㅣ ㅁ ....'① [Phone unit decomposition step] The received name text information is decomposed into the phoneme unit, Jamo. For example, 'Minah Kim' -> 'a ㅣ ㅁ ....'

② [발음 규칙 적용 단계] 기저장된 발음 규칙을 기초로 발음열 변환. 예를 들어, 경음화 법칙('이국비[이국삐]', '김국증[김국쯩]'), 자음동화, 구개음화 등의 표준어 규정 제2부의 표준발음법의 발음 규칙을 기초로 함.② [Pronunciation rule application step] Conversion of pronunciation string based on pre-stored pronunciation rules. For example, it is based on the pronunciation rules of the standard pronunciation method in Part 2 of the standard language regulations such as hard phoneticization rules ('Lee Guk-bi [Lee Guk-pi]', 'Kim Guk-jeung [Kim Guk-jeung]'), consonant assimilation, palatalization, etc.

③ [발음 통일 단계] 발음상 크게 차이가 없는 발음들을 통일. 예를 들어, 'ㅞ/ㅚ' 발음을 'ㅙ' 발음으로 통일함. ③ [Pronunciation unification stage] Unify pronunciations that do not differ significantly in pronunciation. For example, the pronunciation of 'ㅞ/ㅚ' is unified into the pronunciation of 'ㅙ'.

④ [이름 발음 정보 생성 단계] 자모로 분해된 텍스트 정보를 결합하여 이름 발음 정보를 생성.④ [Name pronunciation information generation step] Create name pronunciation information by combining text information decomposed into letters.

이에 따르면, 언어 특성 벡터 생성 모듈(12), 음향 특성 벡터 생성 모듈(13), 이름 음성 정보 생성 모듈(14)에 포함된 인공신경망의 연산이 최적화되어 이름 음성 정보 생성 성능이 향상되고, 학습 시간이 저감되며, 특정 수준의 이름 음성 정보 생성 성능을 위해 필요한 학습 데이터의 양이 저감되는 효과가 발생된다. According to this, the operation of the artificial neural network included in the language characteristic vector generation module 12, the acoustic characteristic vector generation module 13, and the name speech information generation module 14 is optimized, so that the name speech information generation performance is improved, and the learning time This is reduced, and the effect of reducing the amount of learning data required for a specific level of name-to-speech information generation performance is generated.

본 발명의 다른 실시예에 따른 이름 발음 정보 생성 모듈(11)은 이름 텍스트 정보를 입력 데이터로 하고 이름 발음 정보를 출력 데이터로 하는 기학습된 RNN(LSTM 셀 또는 GRU 셀 등) 구조의 인공신경망 모듈을 포함하도록 구성될 수 있다. The name pronunciation information generating module 11 according to another embodiment of the present invention is an artificial neural network module of a pre-learned RNN (LSTM cell or GRU cell, etc.) structure using name text information as input data and name pronunciation information as output data. It may be configured to include

이름 텍스트 정보가 영어 등의 외국어인 경우, 이름 발음 정보 생성 모듈(11)은 음소 단위 분해 단계 이전에 외국어의 이름 텍스트 정보를 한글의 이름 텍스트 정보로 전환하는 한글 전환 단계를 더 포함하여 수행하도록 구성될 수 있다. 이름 발음 정보 생성 모듈(11)의 한글 전환 단계는 외국어의 이름 텍스트 정보를 외래어 표기법에 따라 한글 발음열로 전환하여 한글의 이름 텍스트 정보를 생성하게 되고, 생성된 한글의 이름 텍스트 정보를 이용하여 음소 단위 분해 단계 이후의 단계를 수행하도록 구성된다. 이에 따르면, 이름 발음 정보 생성 모듈(11)을 외국어에 따라 별도로 구성하지 않아도 되는 효과가 발생된다. 본 발명의 일실시예에 따라 특정 외국어에 대하여 별도의 이름 발음 정보 생성 모듈(11)을 구성하지 않게 되면, 각 언어의 다양한 발음 규칙 때문에 모듈의 복잡도가 증가되고 성능이 저감되는 문제가 발생되지 않는 효과가 있다. If the name text information is a foreign language such as English, the name pronunciation information generating module 11 is configured to further include a Hangul conversion step of converting the name text information of the foreign language into the name text information of Korean before the phoneme unit decomposition step can be The Hangul conversion step of the name pronunciation information generation module 11 converts the name text information of a foreign language into a Hangul pronunciation string according to the foreign language notation to generate the name text information of Hangul, and uses the generated name text information of Hangul to generate phonemes. and perform the steps after the unit decomposition step. According to this, there is an effect that it is not necessary to separately configure the name pronunciation information generating module 11 according to the foreign language. If a separate name pronunciation information generating module 11 is not configured for a specific foreign language according to an embodiment of the present invention, the problem of increasing the complexity of the module and reducing performance due to various pronunciation rules of each language does not occur. It works.

특히 영어와 같이 표음성이 낮은 언어에 대해서는 한글 전환 인공신경망이 이름 발음 정보 생성 모듈(11)에 포함되어 이용될 수 있다. 한글 전환 인공신경망은 입력 데이터를 외국어의 이름 텍스트 정보를 입력 데이터로 하고 출력 데이터를 한글의 이름 텍스트 정보로 하는 GRU(Gated recurrent unit) 셀 또는 LSTM(Long-Short Term Memory) 셀을 포함하는 순환신경망(RNN, Recurrent Neural Network)으로 구성될 수 있으며, 학습 단계에서는 입력된 외국어의 이름 텍스트 정보에 의해 한글 전환 인공신경망에서 출력된 한글의 이름 텍스트 정보와 외래어 표기법을 기초로 생성된 한글의 이름 텍스트 정보(Ground truth)와의 차이를 기초로 한 Back Propagation 등의 방법으로 인공신경망을 학습시킬 수 있다. 이에 따르면, 영어와 같이 표음성이 낮은 언어에 대해서도 외국어의 이름 텍스트 정보를 한글의 이름 텍스트 정보로 전환할 수 있게 되는 효과가 발생된다. 이에 따르면, 'though[도]', 'through[스루]', 'rough[러프]', 'cough[코프]', 'thought[소트]', 'bough[바우]' 등과 같이 'ough'에 대해 다양한 발음을 가지는 영어의 낮은 표음성에도 불구하고 영어를 한글의 발음열로 구성된 이름 발음 정보로 생성할 수 있게 되는 효과가 발생된다. 또한, 영어의 모음 음소에 대한 모음비음화 규칙, 모음 상승 규칙, 모음 하강 규칙, 모음 무성화 규칙, 슈와융합 규칙, 그리고 영어의 자음 음소에 대한 기식음화 규칙, 연구개음화 규칙, 음절성 자음화 규칙, 비개방음화 규칙, 치음화 규칙, 탄설음화 규칙, 경음화 규칙, 원순음화 규칙, 무성음화 규칙, 유성음화 규칙, 순치음화 규칙 등의 복잡한 발음 규칙을 이름 발음 정보 생성 모듈(11)에 구성하지 않아도 영어의 발음열을 출력할 수 있게 되는 효과가 발생된다.In particular, for a low phonetic language such as English, a Hangul conversion artificial neural network may be included in the name pronunciation information generating module 11 and used. The Hangul Conversion Artificial Neural Network is a recurrent neural network including GRU (Gated Recurrent Unit) cells or LSTM (Long-Short Term Memory) cells, in which the input data is the name text information of a foreign language as the input data and the output data is the name text information of the Korean language. (RNN, Recurrent Neural Network), and in the learning stage, the name text information of Hangul outputted from the Hangul conversion artificial neural network by the name text information of the foreign language inputted and the name text information of Hangul generated based on the foreign language notation An artificial neural network can be trained by methods such as back propagation based on the difference from the ground truth. According to this, even in a language with low phoneticity, such as English, name text information of a foreign language can be converted into name text information of Korean. According to this, in 'ough' such as 'though [degree]', 'through [through]', 'rough [rough]', 'cough [cough]', 'thought [sort]', 'bough [bough]', etc. In spite of the low phoneticity of English, which has various pronunciations, it is possible to generate English as name pronunciation information composed of the pronunciation sequence of Hangul. In addition, the vowel nasalization rule for vowel phonemes in English, the vowel ascending rule, the vowel descending rule, the vowel unvoicing rule, the Schwa fusion rule, and the aspirationalization rule for the English consonant phoneme, the velar consonant rule, the syllable consonant rule, No need to configure complex pronunciation rules such as non-open phoneticization rules, gingivalization rules, diaphragmization rules, hard phoneticization rules, circularization rules, unvoiced rules, voiced rules, and labialization rules in the name pronunciation information generation module 11 The effect of being able to output the pronunciation sequence of English is generated.

언어 특성 벡터 생성 모듈(12)은 상기 이름 발음 정보 생성 모듈(11)에서 생성된 이름 발음 정보를 수신하고, 이름 발음 정보를 기초로 이름 발음 정보의 텍스트에 대한 언어 특성 벡터를 생성하는 인코딩 모듈이다. 도 5는 언어 특성 벡터 생성 모듈(12)의 언어 특성 벡터 생성을 도시한 모식도이다. 도 5에 도시된 바와 같이, 본 발명의 일실시예에 따른 언어 특성 벡터 생성 모듈(12)은 아래의 단계로 언어 특성 벡터를 생성하도록 구성될 수 있다. The language characteristic vector generation module 12 is an encoding module that receives the name pronunciation information generated by the name pronunciation information generation module 11 and generates a language characteristic vector for the text of the name pronunciation information based on the name pronunciation information. . FIG. 5 is a schematic diagram illustrating the generation of a language characteristic vector by the language characteristic vector generation module 12. As shown in FIG. As shown in FIG. 5 , the language feature vector generation module 12 according to an embodiment of the present invention may be configured to generate a language feature vector in the following steps.

① 이름 발음 정보의 자모를 분해하고, 각 자모를 임베딩하여 각각의 자모에 대한 기설정된 캐릭터 임베딩 정보(Character embedding data)인 자모 벡터 생성. ① Decompose the letters of the name pronunciation information and embed each letter to create a letter vector that is preset character embedding data for each letter.

② 자모 벡터를 입력데이터로 하고, 해당 이름 발음 정보를 가장 잘 나타내는 벡터를 의미하는 각 자모에 대한 텍스트 임베딩 정보(Text embedding data)인 언어 특성 벡터를 출력데이터로 하는 기학습된 인공신경망 모듈(임베딩 인공신경망)에 ①단계의 자모 벡터를 입력.② A pre-learned artificial neural network module (embedding) that uses a alphabet vector as input data and a language feature vector, which is text embedding data for each alphabet, which means the vector that best represents the name pronunciation information. Input the letter vector of step ① into the artificial neural network).

③ 이름 발음 정보에 대한 텍스트 임베딩 정보인 언어 특성 벡터를 출력.③ Outputs a language characteristic vector, which is text embedding information for name pronunciation information.

본 발명의 일실시예에 따르면, ②단계의 기학습된 인공신경망 모듈(임베딩 인공신경망)은 FC(Fully Connected Layer)-ReLU-Dropout-FC-ReLU-Dropout의 n개의 FC Layer 구조로 구성된 Pre-net, n-D Convolution layer, Bidirectional RNN 모듈(forward sequence와 backward sequence를 결합하여 출력)이 연결되어 이름 발음 정보에 대한 텍스트 임베딩 정보인 언어 특성 벡터가 출력되도록 구성될 수 있다. According to an embodiment of the present invention, the pre-learned artificial neural network module (embedding artificial neural network) in step ② is a Pre- A net, nD convolution layer, and a bidirectional RNN module (output by combining a forward sequence and a backward sequence) may be connected to output a language characteristic vector that is text embedding information for name pronunciation information.

이때, Pre-net은 dropout 기법을 적용한 2층의 fully connected layer로서 과적합을 방지하고 트레이닝이 수렴하는 것을 돕는 네트워크이다.At this time, the Pre-net is a network that prevents overfitting and helps training converge as a two-layer fully connected layer to which the dropout technique is applied.

특히, n-D Convolution layer는 1D Convolution Bank를 이용하여 unigram 부터 K-gram까지의 길이를 가지는 필터로 상기 이름 발음 정보를 convolution하고, 그 결과들을 Stacking하도록 구성될 수 있다. 이후 local invariance를 향상시키기 위해 max pooling이 구성되고, 이때 시간 축의 해상도를 유지하기 위해 stride=1로 구성될 수 있다. 이후의 high level feature들을 추출하기 위해 몇 층의 1D Convolution을 연산한 뒤 residual connection을 적용하여 1D Convolution의 결과에 상기 이름 발음 정보를 더한 값이 highway network를 거쳐서 Bidirectional RNN 모듈에 입력되도록 구성될 수 있다. 이때, Highway network는 한 층의 신경망을 거친 결과와 최초 입력 정보인 상기 이름 발음 정보를 weighted sum 하는 구조로 구성될 수 있다. In particular, the n-D convolution layer may be configured to convolution the name pronunciation information with a filter having a length from unigram to K-gram using a 1D convolution bank and stack the results. Afterwards, max pooling is configured to improve local invariance, and at this time, stride=1 can be configured to maintain the resolution of the time axis. After calculating the 1D convolution of several layers to extract the high level features after that, the residual connection is applied, and the value obtained by adding the name pronunciation information to the result of the 1D convolution can be configured to be input to the Bidirectional RNN module through the highway network. . In this case, the highway network may be configured as a weighted sum of the result of passing through a neural network of one layer and the name pronunciation information, which is the first input information.

음향 특성 벡터 생성 모듈(13)은 언어 특성 벡터 생성 모듈(12)의 언어 특성 벡터를 수신하고, 언어 특성 벡터를 기초로 스펙트로그램 정보 등의 음향 특성(시간, 주파수, 진폭)에 관한 음향 특성 벡터를 생성하는 디코딩 모듈이다. 보다 구체적으로, 음향 특성 벡터 생성 모듈(13)은 특정 time step 프레임의 음향 특성 벡터를 입력으로 받고, 상기 언어 특성 벡터를 기초로 콘텍스트 모듈(context module)에서 생성되는 콘텍스트 벡터(context vector)와 콘텍스트 인공신경망 모듈의 hidden state vector의 concatenate된 벡터를 기초로 다음 time step 프레임의 음향 특성 벡터를 출력하도록 구성될 수 있다. 특히, 본 발명의 일실시예에 따른 음향 특성 벡터 생성 모듈(13)은 time step 당 하나가 아닌 여러 프레임의 음향 특성 벡터를 출력함으로써 트레이닝 시간, 합성 시간, 모델 사이즈를 저감시킬 수 있다. The acoustic characteristic vector generation module 13 receives the language characteristic vector of the language characteristic vector generation module 12, and an acoustic characteristic vector related to acoustic characteristics (time, frequency, amplitude) such as spectrogram information based on the language characteristic vector It is a decoding module that generates More specifically, the acoustic characteristic vector generation module 13 receives an acoustic characteristic vector of a specific time step frame as an input, and generates a context vector and a context based on the language characteristic vector in a context module. It can be configured to output the acoustic characteristic vector of the next time step frame based on the concatenated vector of the hidden state vector of the artificial neural network module. In particular, the acoustic characteristic vector generating module 13 according to an embodiment of the present invention can reduce training time, synthesis time, and model size by outputting acoustic characteristic vectors of several frames instead of one per time step.

학습 세션에서는, 음향 특성 벡터 생성 모듈(13)은 이름 음성 정보 병합 모듈(15)에서 병합될 특정 화자의 음성 정보/영상 정보에 대한 언어 특성 벡터를 입력 정보로 하고 및 해당 특성 화자의 음성 정보를 출력 정보 및 Ground truth로 하여 학습될 수 있다. In the learning session, the acoustic characteristic vector generating module 13 uses, as input information, a language characteristic vector for voice information/video information of a specific speaker to be merged in the name voice information merging module 15, and the voice information of the corresponding characteristic speaker. It can be learned with output information and ground truth.

도 6은 본 발명의 일실시예에 다른 음향 특성 벡터 생성 모듈(13)의 음향 특성 벡터 생성을 도시한 모식도이다. 도 6에 도시된 바와 같이, 본 발명의 일실시예에 따른 음향 특성 벡터 생성 모듈(13)은, 콘텍스트 모듈(context module), 콘텍스트 인공신경망 모듈(RNN), 디코더 인공신경망 모듈(RNN), 멜 스펙트로그램 인공신경망 모듈을 포함할 수 있고, 아래의 단계로 음향 특성 벡터를 생성하도록 구성될 수 있다. 6 is a schematic diagram illustrating acoustic characteristic vector generation of the acoustic characteristic vector generating module 13 according to an embodiment of the present invention. As shown in FIG. 6 , the acoustic characteristic vector generating module 13 according to an embodiment of the present invention includes a context module, a context neural network module (RNN), a decoder artificial neural network module (RNN), and a mel It may include a spectrogram artificial neural network module, and may be configured to generate an acoustic characteristic vector in the following steps.

① 콘텍스트 모듈이 언어 특성 벡터 및 콘텍스트 인공신경망 모듈(이전 시기의 음향 특성 벡터의 적어도 일부 스펙트로그램 정보가 입력 벡터)의 히든 스테이트 벡터(hidden state vector) 및 콘텍스트 인공신경망 모듈의 출력 벡터를 입력받아 콘텍스트 벡터(context vector)를 출력.① The context module receives the language feature vector, the hidden state vector of the context neural network module (at least some spectrogram information of the acoustic feature vector of the previous period is an input vector), and the output vector of the context neural network module, and receives the context Output a vector (context vector).

② 디코더 인공신경망 모듈이 콘텍스트 벡터 및 언어 특성 벡터를 결합(Concatenation)한 입력 벡터를 Sequence의 형태로 입력받고, 음향 특성 벡터를 Sequence의 형태로 출력. 특히, 디코더 인공신경망 모듈은 순환 인공신경망(RNN)으로 구성되어 Sequence 형태의 입력 벡터가 입력된 이후부터는 이전 시기의 음향 특성 벡터의 적어도 일부 스펙트로그램 정보(이전 시기가 t=initial 인 경우에는 <go>)가 입력 벡터로서 입력되어 현재 시기의 음향 특성 벡터를 출력.② The decoder artificial neural network module receives an input vector obtained by concatenating a context vector and a language characteristic vector in the form of a sequence, and outputs the acoustic characteristic vector in the form of a sequence. In particular, the decoder artificial neural network module is composed of a recurrent neural network (RNN), and after an input vector in the form of a sequence is input, at least some spectrogram information of the acoustic characteristic vector of the previous period (if the previous period is t=initial, <go >) is input as an input vector to output the acoustic characteristic vector of the current period.

③ 멜 스펙트로그램 인공신경망 모듈(CNN, Convolutional Neural Network)이 음향 특성 벡터를 입력받아 상기 음향 특성 벡터에 대한 보코더 입력 정보인 멜 스펙트로그램(Mel-Spectrogram)을 생성③ Mel Spectrogram An artificial neural network module (CNN, Convolutional Neural Network) receives an acoustic characteristic vector and generates a Mel-Spectrogram, which is vocoder input information for the acoustic characteristic vector.

이때, 예를 들어, 본 발명의 일실시예에 따른 콘텍스트 인공신경망 모듈은 256-unit GRU 1 Layer로 구성될 수 있고, 디코더 인공신경망 모듈은 모델의 보다 빠른 수렴을 위하여 residual connection을 포함한 256-unit GRU 2 Layer로 구성될 수 있다. At this time, for example, the context neural network module according to an embodiment of the present invention may be configured as a 256-unit GRU 1 Layer, and the decoder artificial neural network module includes a 256-unit residual connection for faster convergence of the model. It may be composed of GRU 2 Layer.

본 발명의 일실시예에 따른 콘텍스트 벡터는, 디코더 인공신경망 모듈에 입력되는 언어 특성 벡터의 Sequence 중 일부에 대한 Sequence 형태의 가중치 벡터를 의미하고, 디코더 인공신경망 모듈에서 출력되는 현재 시기의 음향 특성 벡터가 언어 특성 벡터의 Sequence 중 어느 일부에 의해 주로 출력될지를 결정하게 되어 기학습되지 않은 이름 텍스트 정보에 대해서도 음향 특성 벡터를 안정적으로 출력할 수 있게 되는 효과가 발생된다. A context vector according to an embodiment of the present invention means a sequence-type weight vector for a part of a sequence of language characteristic vectors input to the decoder artificial neural network module, and an acoustic characteristic vector of the current time output from the decoder artificial neural network module. Determines which part of the sequence of language characteristic vectors is to be mainly output, resulting in an effect of stably outputting acoustic characteristic vectors even for name text information that has not been previously learned.

본 발명의 일실시예에 따른 멜 스펙트로그램 인공신경망 모듈은 n-layer의 ConvNet(CNN, Convolutional Neural Network)으로 구성되어 입력 데이터를 음향 특성 벡터로 하고, 출력 데이터를 해당 음향 특성 벡터의 멜 스펙트로그램(보코더 입력 정보)으로 하는 인공신경망 모듈이다. The mel spectrogram artificial neural network module according to an embodiment of the present invention is composed of an n-layer Convolutional Neural Network (CNN), uses input data as an acoustic characteristic vector, and converts the output data to a Mel spectrogram of the corresponding acoustic characteristic vector. (vocoder input information) is an artificial neural network module.

본 발명에서는 설명의 편의를 위하여 멜 스케일의 스펙트로그램인 멜 스펙트로그램으로 기재하였으나 본 발명의 범위는 이에 한정되지 않으며, 본 발명의 일실시예에 따른 보코더 입력 정보는 멜 스펙트로그램에 한정되지 않으며, mel-filterbank를 거치지 않은 기본적인 spectrogram, 스펙트럼, Fundamental frequency를 의미하는 f0 등 Fourier Transform을 활용한 주파수 정보, 신호에서의 비주기성 구성요소와 음성 신호간 비율을 의미하는 aperiodicity 등을 포함할 수 있다. In the present invention, for convenience of explanation, Mel spectrogram, which is a spectrogram of Mel scale, is described as a Mel spectrogram, but the scope of the present invention is not limited thereto, and the vocoder input information according to an embodiment of the present invention is not limited to the Mel spectrogram, It may include frequency information using Fourier Transform such as f0, which means basic spectrogram, spectrum, and fundamental frequency, which has not been passed through mel-filterbank, and aperiodicity, which means the ratio between the aperiodic component in the signal and the voice signal.

이름 음성 정보 생성 모듈(14)은 디코더 인공신경망에서 출력된 음향 특성 벡터에 대한 보코더 입력 정보인 멜 스펙트로그램(Mel-Spectrogram)를 수신하고, 음향 특성 벡터의 멜 스펙트로그램(예를 들어, 80 밴드의 멜 스케일 스펙트로그램)을 Griffin-Lim, WaveNet 등의 음성 생성 모듈에 입력하여 이름 음성 정보(예를 들어, 1025차 선형 스케일 스펙트로그램)를 생성하는 보코더 모듈(vocoder module)이다. 즉, 이름 음성 정보 생성 모듈(14)은 멜 스케일의 음향 특성 벡터를 선형 스케일로 변환하기 위한 후처리 네트워크이다. The name speech information generation module 14 receives a Mel-Spectrogram, which is vocoder input information for an acoustic characteristic vector output from the decoder artificial neural network, and a Mel-Spectrogram (eg, 80 band) of the acoustic characteristic vector. It is a vocoder module that generates name speech information (eg, 1025th order linear scale spectrogram) by inputting the mel scale spectrogram of Griffin-Lim, WaveNet, etc. to speech generation modules. That is, the name voice information generation module 14 is a post-processing network for converting the acoustic characteristic vector of the Mel scale into a linear scale.

이름 음성 정보 병합 모듈(15)은 이름 영역이 기설정된 특정 화자의 음성 정보 또는 상기 특정 화자의 영상 정보(병합 대상 콘텐츠 정보)의 이름 영역에 이름 음성 정보 생성 모듈(14)에서 생성된 이름 음성 정보를 삽입하여 이름 음성 정보와 해당 특정 화자의 음성 정보 또는 영상 정보를 병합하여 병합 콘텐츠 정보를 생성하는 모듈이다. 도 7은 본 발명의 일실시예에 따른 이름 음성 정보 병합 모듈(15)의 이름 음성 정보의 병합을 도시한 모식도이다. 본 발명의 일실시예에 따른 이름 영역은, 특정 화자의 음성 정보 또는 영상 정보 내에서 이름 음성 정보가 삽입될 영역의 시작 시간과 종료 시간으로 구성될 수 있다. The name voice information merging module 15 includes the name voice information generated by the name voice information generating module 14 in the name region of the voice information of the specific speaker or the video information of the specific speaker (the content information to be merged) in which the name region is preset. It is a module that generates merged content information by inserting the name and voice information and merging the voice information or video information of the specific speaker. 7 is a schematic diagram illustrating merging of name voice information in the name voice information merging module 15 according to an embodiment of the present invention. The name field according to an embodiment of the present invention may be composed of a start time and an end time of an area in which the name and voice information is to be inserted in the voice information or image information of a specific speaker.

본 발명의 일실시예에 따르면, 사용자 또는 서버의 이름 텍스트 정보 입력에 의해, 입력된 이름을 특정 화자가 해당 특정 화자의 목소리로 호출/언급해주는 맞춤형 음성 정보 또는 영상 정보의 생성(콜미 서비스)이 가능해지는 효과가 발생된다.According to one embodiment of the present invention, by inputting name text information of a user or server, a specific speaker calls/refers to the input name with the voice of the specific speaker, and the creation of customized voice information or video information (Call Me service) The possible effect is generated.

이상에서 설명한 바와 같이, 본 발명이 속하는 기술 분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 상술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함하는 것으로 해석되어야 한다.As described above, those skilled in the art to which the present invention pertains will be able to understand that the present invention can be implemented in other specific forms without changing the technical spirit or essential features thereof. Therefore, the above-described embodiments are to be understood in all respects as illustrative and not restrictive. The scope of the present invention is indicated by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts should be construed as being included in the scope of the present invention.

본 명세서 내에 기술된 특징들 및 장점들은 모두를 포함하지 않으며, 특히 많은 추가적인 특징들 및 장점들이 도면들, 명세서, 및 청구항들을 고려하여 당업자에게 명백해질 것이다. 더욱이, 본 명세서에 사용된 언어는 주로 읽기 쉽도록 그리고 교시의 목적으로 선택되었고, 본 발명의 주제를 묘사하거나 제한하기 위해 선택되지 않을 수도 있다는 것을 주의해야 한다.The features and advantages described herein are not all inclusive, and many additional features and advantages will become apparent to those skilled in the art, particularly upon consideration of the drawings, the specification, and the claims. Moreover, it should be noted that the language used herein has been primarily selected for readability and teaching purposes, and may not be selected to delineate or limit the subject matter of the present invention.

본 발명의 실시예들의 상기한 설명은 예시의 목적으로 제시되었다. 이는 개시된 정확한 형태로 본 발명을 제한하거나, 빠뜨리는 것 없이 만들려고 의도한 것이 아니다. 당업자는 상기한 개시에 비추어 많은 수정 및 변형이 가능하다는 것을 이해할 수 있다.The foregoing description of embodiments of the present invention has been presented for purposes of illustration. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Those skilled in the art will appreciate that many modifications and variations are possible in light of the above disclosure.

그러므로 본 발명의 범위는 상세한 설명에 의해 한정되지 않고, 이를 기반으로 하는 출원의 임의의 청구항들에 의해 한정된다. 따라서, 본 발명의 실시예들의 개시는 예시적인 것이며, 이하의 청구항에 기재된 본 발명의 범위를 제한하는 것은 아니다.Therefore, the scope of the present invention is not limited by the detailed description, but by any claims of the application based thereon. Accordingly, the disclosure of the embodiments of the present invention is illustrative and not intended to limit the scope of the present invention as set forth in the following claims.

1: 콜미 서비스를 위한 음성 합성 장치
10: 이름 텍스트 정보 수신 모듈
11: 이름 발음 정보 생성 모듈
12: 언어 특성 벡터 생성 모듈
13: 음향 특성 벡터 생성 모듈
14: 이름 음성 정보 생성 모듈
15: 이름 음성 정보 병합 모듈
100: 사용자 클라이언트
200: 서비스 서버1: Voice synthesizer for call-me service
10: Name text information receiving module
11: Name pronunciation information generation module
12: Language feature vector generation module
13: Acoustic characteristic vector generation module
14: name voice information generation module
15: Name Voice Information Merge Module
100: user client
200: service server

Claims

a name pronunciation information generating module that generates name pronunciation information in which the name text information is converted into a pronunciation string form;
a language characteristic vector generation module that receives the name pronunciation information generated by the name pronunciation information generation module and generates a language characteristic vector for the text of the name pronunciation information based on the name pronunciation information;
an acoustic characteristic vector generation module that receives the language characteristic vector of the language characteristic vector generation module and generates an acoustic characteristic vector based on the language characteristic vector;
a name voice information generating module that receives the acoustic characteristic vector output from the acoustic characteristic vector generating module and generates name voice information by using the acoustic characteristic vector; and
By inserting the name voice information generated by the name voice information generating module in the name region of the content information to be merged, which is the voice information or video information of the specific speaker in which the name region is preset, the name voice information and the specific speaker are merged a name-voice information merging module for merging target information and generating merged content information;
includes,
The name pronunciation information generation module generates the name pronunciation information by unifying the similar pronunciation sequence for each phoneme after converting the name text information into a pronunciation string form,
The language characteristic vector generation module generates phoneme information by decomposing the alphabet of the name pronunciation information, embedding the phoneme information to generate a Jamo vector that is character embedding information for each Jamo, and using the Jamo vector as input data and an artificial neural network module using the language feature vector as text embedding information for the name pronunciation information as output data, inputting the alphabet vector to the artificial neural network module to output the language feature vector,
characterized in that the call-me service is configured to generate the merged content information so as to call the name text information based on the voice characteristic of the specific speaker and output the merge target content information,
A speech synthesis device for call-me service using language feature vectors.

According to claim 1,
The artificial neural network module of the language feature vector generation module comprises a Pre-net, a Convolution Layer, and a Bidirectional RNN module including a plurality of fully connected layers. ,
A speech synthesis device for call-me service using language feature vectors.

a name pronunciation information generation step of generating, by the name pronunciation information generating module, name pronunciation information in which the name text information is converted into a pronunciation string form;
A language characteristic vector generation step of receiving, by a language characteristic vector generating module, the name pronunciation information generated by the name pronunciation information generation module, and generating a language characteristic vector for the text of the name pronunciation information based on the name pronunciation information ;
an acoustic characteristic vector generating step in which an acoustic characteristic vector generation module receives the language characteristic vector of the language characteristic vector generation module, and generates an acoustic characteristic vector related to spectrogram information based on the language characteristic vector;
a name voice information generating step of receiving, by a name voice information generating module, the acoustic characteristic vector output from the acoustic characteristic vector generating module, and generating name voice information by using the acoustic characteristic vector; and
The name voice information merging module inserts the name voice information generated by the name voice information generating module into the name region of the content information to be merged, which is voice information or image information of a specific speaker in which the name region is preset to insert the name voice information and a name-voice information merging step of merging the merging target information of the specific speaker and generating merged content information;
includes,
In the name pronunciation information generation step, the name pronunciation information generating module converts the name text information into a pronunciation string form and unifies the similar pronunciation sequence for each phoneme to generate the name pronunciation information,
The linguistic feature vector generation module includes an artificial neural network module that uses a character vector as input data and the linguistic feature vector as output data, which is text embedding information for the name pronunciation information,
In the language characteristic vector generation step, the language characteristic vector generation module generates phoneme information by decomposing the alphabet of the name pronunciation information, and embeds the phoneme information to generate the alphabet vector, which is character embedding information for each alphabet. and outputting the language feature vector by inputting the alphabet vector to the artificial neural network module,
characterized in that the call-me service is configured to generate the merged content information so as to call the name text information based on the voice characteristic of the specific speaker and output the merge target content information,
A method of synthesizing the voice of a call-me service using a language feature vector.

4. The method of claim 3,
In the linguistic feature vector generating step, the artificial neural network module of the linguistic feature vector generating module includes a Pre-net, Convolution Layer and Bidirectional RNN including a plurality of fully connected layers. characterized in that it comprises a module,
A method of synthesizing the voice of a call-me service using a language feature vector.