KR102222597B1

KR102222597B1 - Voice synthesis apparatus and method for 'call me' service

Info

Publication number: KR102222597B1
Application number: KR1020200037641A
Authority: KR
Inventors: 정승환
Original assignee: (주)라이언로켓
Priority date: 2020-02-03
Filing date: 2020-03-27
Publication date: 2021-03-05
Anticipated expiration: 2040-03-27
Also published as: KR20210122071A; KR20210122069A; KR102401243B1; KR20210122070A; KR102352986B1; KR102352987B1

Abstract

The present invention relates to a voice synthesis apparatus for a call-me service and method thereof. The voice synthesis apparatus for a call-me service may comprise a name pronunciation information generating module receiving name text information from a name text information receiving module and generating name pronunciation information in which the received name text information is converted into a pronunciation string form. According to one embodiment of the present invention, there is an effect of enabling a service for calling a user with a voice of a speaker with respect to voices/videos of various speakers.

Description

Voice synthesis apparatus and method for'call me' service {Voice synthesis apparatus and method for'call me' service}

본 발명은 콜미 서비스를 위한 음성 합성 장치 및 방법에 관한 것이다. The present invention relates to an apparatus and method for synthesizing a voice for a call me service.

음성합성(speech synthesis)은 인위적으로 사람의 소리를 합성하여 만들어내는 것으로, 텍스트를 음성으로 변환하는 텍스트 음성변환(text-to-speech, TTS)이 대표적인 음성합성 방법으로 이용되고 있다.Speech synthesis is artificially created by synthesizing human sounds, and text-to-speech (TTS), which converts text into speech, is used as a representative speech synthesis method.

기존의 TTS 알고리즘은 주로 Unit selection approach가 활용되고 있었고, 이러한 알고리즘은 입력되는 텍스트 정보에서 linguistic feature를 추출하고 waveform으로 붙여넣어 출력하는 접근이었다. 여기서의 핵심은 오디오 세그먼트들을 기존의 waveform dictionary와 어떻게 잘 매칭할지가 관건이었다. 이후 Statistical parametic approach가 출현하였으며, Statistical parametric approach는 추출된 linguistic feature를 확률적으로 acoustic feature로 변환하는 모듈을 더 포함한 접근이었지만, 음질이나 노이즈 문제가 많아 상용화되기는 어려웠다.In the existing TTS algorithm, a unit selection approach is mainly used, and this algorithm extracts linguistic features from input text information and pastes them into a waveform for output. The key here was how to match the audio segments well with the existing waveform dictionary. Since then, the Statistical parametic approach appeared, and the Statistical parametric approach was an approach that further included a module that converts the extracted linguistic features into acoustic features probabilistically, but it was difficult to commercialize it due to the problems of sound quality or noise.

딥러닝의 출현 이후 기존의 TTS와 달리 End-to-end approach가 발달하게 되었고, 대표적인 알고리즘으로 Wavenet, Tacotron, Tacotron2 등이 있다. 이러한 딥러닝 알고리즘들은 음절이나 음소의 임베딩 벡터(linguistic feature)를 생성하는 인코딩 단계, linguistic feature를 확률적으로 acoustic feature로 변환하는 디코딩 단계, acoustic feature를 waveform으로 변환하는 Vocoder 단계를 포함하며, 각각의 단계에서 RNN/LSTM 등의 딥러닝이 활용될 수 있다. Since the advent of deep learning, end-to-end approach has developed unlike the existing TTS, and representative algorithms include Wavenet, Tacotron, and Tacotron2. These deep learning algorithms include an encoding step of generating a syllable or phoneme embedding vector (linguistic feature), a decoding step of probabilistically converting a linguistic feature into an acoustic feature, and a Vocoder step of converting an acoustic feature into a waveform. In step, deep learning such as RNN/LSTM can be used.

대한민국 등록특허 10-1716013, 다수의 언어들에 대한 콘텐츠의 음성 합성 처리, 애플 인크.Republic of Korea Patent Registration 10-1716013, speech synthesis processing of content for multiple languages, Apple Inc.

하지만, 이러한 TTS 서비스들은 단순히 텍스트에서 특정 음성 특성을 지니는 음성을 출력하는데에 그치고 있고, 사용자 측면에서 감정적으로 와닿고 인간적인 교감을 주는 서비스는 출현하지 못하고 있는 실정이다. However, these TTS services are merely outputting voices with specific voice characteristics from text, and services that touch emotionally and provide human rapport from the user's point of view have not emerged.

따라서, 본 발명의 목적은 다양한 화자의 음성/영상에 사용자의 이름에 대한 음성을 합성함으로써 사용자가 깊이 교감할 수 있는 콜미 서비스를 위한 음성 합성 장치 및 방법을 제공하는 데에 있다. Accordingly, an object of the present invention is to provide an apparatus and method for synthesizing a voice call for a call-me service in which a user can deeply communicate by synthesizing a voice for a user's name with voice/video of various speakers.

이하 본 발명의 목적을 달성하기 위한 구체적 수단에 대하여 설명한다.Hereinafter, specific means for achieving the object of the present invention will be described.

본 발명의 목적은, 사용자 클라이언트(100) 또는 특정 서비스 서버(200)에서 사용자 또는 제3자에 대한 이름 텍스트 정보를 수신하는 이름 텍스트 정보 수신 모듈; 상기 이름 텍스트 정보 수신 모듈에서 상기 이름 텍스트 정보를 수신하고, 수신된 상기 이름 텍스트 정보가 발음 문자열 형태로 변환된 이름 발음 정보를 생성하는 이름 발음 정보 생성 모듈; 상기 이름 발음 정보 생성 모듈에서 생성된 상기 이름 발음 정보를 수신하고, 상기 이름 발음 정보를 기초로 상기 이름 발음 정보의 텍스트에 대한 언어 특성 벡터를 생성하는 언어 특성 벡터 생성 모듈; 상기 언어 특성 벡터 생성 모듈의 상기 언어 특성 벡터를 수신하고, 상기 언어 특성 벡터를 기초로 n개의 스펙트로그램 정보를 포함하는 음향 특성 벡터를 생성하는 음향 특성 벡터 생성 모듈; 상기 음향 특성 벡터 생성 모듈에서 출력된 상기 음향 특성 벡터를 수신하고, 상기 음향 특성 벡터를 Griffin-Lim 모듈에 입력하여 이름 음성 정보를 생성하는 이름 음성 정보 생성 모듈; 및 이름 영역이 기설정된 특정 화자의 음성 정보 또는 영상 정보의 상기 이름 영역에 상기 이름 음성 정보 생성 모듈에서 생성된 상기 이름 음성 정보를 삽입하여 상기 이름 음성 정보와 상기 특정 화자의 상기 음성 정보 또는 상기 영상 정보를 병합하는 이름 음성 정보 병합 모듈;을 포함하는, 콜미 서비스를 위한 음성 합성 장치를 제공하여 달성될 수 있다.An object of the present invention is a name text information receiving module for receiving name text information for a user or a third party from a user client 100 or a specific service server 200; A name pronunciation information generating module for receiving the name text information in the name text information receiving module and generating name pronunciation information in which the received name text information is converted into a pronunciation string format; A language characteristic vector generation module that receives the name pronunciation information generated by the name pronunciation information generation module, and generates a language characteristic vector for the text of the name pronunciation information based on the name pronunciation information; An acoustic characteristic vector generation module receiving the language characteristic vector of the language characteristic vector generation module and generating an acoustic characteristic vector including n spectrogram information based on the language characteristic vector; A name speech information generation module for receiving the acoustic characteristic vector output from the acoustic characteristic vector generation module and inputting the acoustic characteristic vector to the Griffin-Lim module to generate name speech information; And the name audio information generated by the name audio information generation module into the name area of the specific speaker's audio information or video information in which a name area is preset, and the name audio information and the audio information or the image of the specific speaker. It can be achieved by providing a voice synthesis device for a call me service, including a name voice information merging module for merging information.

본 발명의 다른 목적은, 사용자 클라이언트 또는 특정 서비스 서버에서 사용자 또는 제3자에 대한 이름 텍스트 정보를 수신하는 이름 텍스트 정보 수신 모듈; 상기 이름 텍스트 정보 수신 모듈에서 상기 이름 텍스트 정보를 수신하고, 수신된 상기 이름 텍스트 정보가 발음 문자열 형태로 변환된 이름 발음 정보를 생성하는 이름 발음 정보 생성 모듈; 상기 이름 발음 정보 생성 모듈에서 생성된 상기 이름 발음 정보를 수신하고, 상기 이름 발음 정보를 기초로 상기 이름 발음 정보의 텍스트에 대한 언어 특성 벡터를 생성하는 언어 특성 벡터 생성 모듈; 상기 언어 특성 벡터 생성 모듈의 상기 언어 특성 벡터를 수신하고, 상기 언어 특성 벡터를 기초로 스펙트로그램 정보에 관한 음향 특성 벡터를 생성하는 음향 특성 벡터 생성 모듈; 상기 음향 특성 벡터 생성 모듈에서 출력된 상기 음향 특성 벡터를 수신하고, 상기 음향 특성 벡터를 이용하여 이름 음성 정보를 생성하는 이름 음성 정보 생성 모듈; 및 이름 영역이 기설정된 특정 화자의 음성 정보 또는 영상 정보인 병합 대상 콘텐츠 정보의 상기 이름 영역에 상기 이름 음성 정보 생성 모듈에서 생성된 상기 이름 음성 정보를 삽입하여 상기 이름 음성 정보와 상기 특정 화자의 상기 병합 대상 정보를 병합하고 병합 콘텐츠 정보를 생성하는 이름 음성 정보 병합 모듈;을 포함하여, 상기 특정 화자의 음성 특성으로 상기 이름 텍스트 정보를 호출하며 상기 병합 대상 콘텐츠 정보를 출력하도록 상기 병합 콘텐츠 정보를 생성하는 콜미 서비스를 구성하는 것을 특징으로 하는, 콜미 서비스를 위한 음성 합성 장치를 제공하여 달성될 수 있다. Another object of the present invention is a name text information receiving module for receiving name text information for a user or a third party in a user client or a specific service server; A name pronunciation information generating module for receiving the name text information in the name text information receiving module and generating name pronunciation information in which the received name text information is converted into a pronunciation string format; A language characteristic vector generation module that receives the name pronunciation information generated by the name pronunciation information generation module, and generates a language characteristic vector for the text of the name pronunciation information based on the name pronunciation information; An acoustic characteristic vector generation module for receiving the language characteristic vector of the language characteristic vector generation module and generating an acoustic characteristic vector for spectrogram information based on the language characteristic vector; A name speech information generation module for receiving the acoustic characteristic vector output from the acoustic characteristic vector generation module and generating name speech information using the acoustic characteristic vector; And the name audio information generated by the name audio information generation module into the name area of the content information to be merged, which is audio information or video information of a specific speaker in which a name area is preset. Including a name voice information merging module for merging the merge target information and generating merged content information, the merged content information is generated to output the merge target content information by calling the name text information with the voice characteristic of the specific speaker. It can be achieved by providing a speech synthesis device for the call-me service, characterized in that the configuration of the call-me service.

또한, 상기 음향 특성 벡터 생성 모듈은, 상기 언어 특성 벡터를 기초로 생성된 콘텍스트 벡터와 특정 프레임의 상기 음향 특성 벡터를 입력으로 하고, 상기 특정 프레임 이후의 프레임의 상기 음향 특성 벡터를 출력으로 하는 인공신경망으로 구성되는 것을 특징으로 할 수 있다.In addition, the acoustic characteristic vector generation module, an artificial context vector generated based on the language characteristic vector and the acoustic characteristic vector of a specific frame as inputs, and the acoustic characteristic vector of a frame after the specific frame as an output. It may be characterized by being composed of a neural network.

또한, 상기 이름 발음 정보 생성 모듈은, 상기 이름 텍스트 정보를 발음 문자열 형태로 변환한 뒤, 유사 발음열을 통일하여 상기 이름 발음 정보를 생성하는 것을 특징으로 할 수 있다. In addition, the name pronunciation information generating module may be characterized in that after converting the name text information into a pronunciation string form, unifying a similar pronunciation sequence to generate the name pronunciation information.

또한, 상기 이름 텍스트 정보가 한글 이외의 언어인 외국어로 구성된 경우, 상기 이름 발음 정보 생성 모듈은, 외국어의 상기 이름 텍스트 정보를 한글의 상기 이름 텍스트 정보로 전환한 뒤에, 상기 이름 텍스트 정보를 발음 문자열 형태로 변환하여 상기 이름 발음 정보를 생성하는 것을 특징으로 할 수 있다. In addition, when the name text information is composed of a foreign language that is a language other than Korean, the name pronunciation information generation module converts the name text information of the foreign language to the name text information of Korean, and then converts the name text information to a pronunciation character string. It may be characterized in that the name pronunciation information is generated by converting it into a form.

본 발명의 다른 목적은, 이름 텍스트 정보 수신 모듈이, 사용자 클라이언트 또는 특정 서비스 서버에서 사용자 또는 제3자에 대한 이름 텍스트 정보를 수신하는 이름 텍스트 정보 수신 단계; 이름 발음 정보 생성 모듈이, 상기 이름 텍스트 정보 수신 모듈에서 상기 이름 텍스트 정보를 수신하고, 수신된 상기 이름 텍스트 정보가 발음 문자열 형태로 변환된 이름 발음 정보를 생성하는 이름 발음 정보 생성 단계; 언어 특성 벡터 생성 모듈이, 상기 이름 발음 정보 생성 모듈에서 생성된 상기 이름 발음 정보를 수신하고, 상기 이름 발음 정보를 기초로 상기 이름 발음 정보의 텍스트에 대한 언어 특성 벡터를 생성하는 언어 특성 벡터 생성 단계; 음향 특성 벡터 생성 모듈이, 상기 언어 특성 벡터 생성 모듈의 상기 언어 특성 벡터를 수신하고, 상기 언어 특성 벡터를 기초로 스펙트로그램 정보에 관한 음향 특성 벡터를 생성하는 음향 특성 벡터 생성 단계; 이름 음성 정보 생성 모듈이, 상기 음향 특성 벡터 생성 모듈에서 출력된 상기 음향 특성 벡터를 수신하고, 상기 음향 특성 벡터를 이용하여 이름 음성 정보를 생성하는 이름 음성 정보 생성 단계; 및 이름 음성 정보 병합 모듈이, 이름 영역이 기설정된 특정 화자의 음성 정보 또는 영상 정보인 병합 대상 콘텐츠 정보의 상기 이름 영역에 상기 이름 음성 정보 생성 모듈에서 생성된 상기 이름 음성 정보를 삽입하여 상기 이름 음성 정보와 상기 특정 화자의 상기 병합 대상 정보를 병합하고 병합 콘텐츠 정보를 생성하는 이름 음성 정보 병합 단계;을 포함하여, 상기 특정 화자의 음성 특성으로 상기 이름 텍스트 정보를 호출하며 상기 병합 대상 콘텐츠 정보를 출력하도록 상기 병합 콘텐츠 정보를 생성하는 콜미 서비스를 구성하는 것을 특징으로 하는, 콜미 서비스를 위한 음성 합성 방법을 제공하여 달성될 수 있다. Another object of the present invention is a name text information receiving step of receiving, by a name text information receiving module, name text information for a user or a third party in a user client or a specific service server; A name pronunciation information generation step of receiving the name text information from the name text information receiving module by the name pronunciation information generation module, and generating name pronunciation information obtained by converting the received name text information into a pronunciation string format; A language characteristic vector generation step of receiving, by the language characteristic vector generation module, the name pronunciation information generated by the name pronunciation information generation module, and generating a language characteristic vector for the text of the name pronunciation information based on the name pronunciation information ; An acoustic characteristic vector generation step of receiving, by an acoustic characteristic vector generation module, the language characteristic vector of the language characteristic vector generation module, and generating an acoustic characteristic vector for spectrogram information based on the language characteristic vector; A name voice information generation step of receiving, by a name voice information generating module, the acoustic feature vector output from the acoustic feature vector generating module, and generating name voice information using the acoustic feature vector; And a name voice information merging module, by inserting the name voice information generated by the name voice information generation module into the name field of the merge target content information, which is voice information or video information of a specific speaker in which a name field is preset. Including a name voice information merging step of merging information and the merge target information of the specific speaker and generating merged content information; Including, calling the name text information with the voice characteristics of the specific speaker, and outputting the merge target content information It may be achieved by providing a voice synthesis method for the call-me service, characterized in that the call-me service for generating the merged content information is configured.

또한, 상기 이름 발음 정보 생성 단계에서, 상기 이름 발음 정보 생성 모듈은 상기 이름 텍스트 정보를 발음 문자열 형태로 변환한 뒤, 유사 발음열을 통일하여 상기 이름 발음 정보를 생성하는 것을 특징으로 할 수 있다.In addition, in the name pronunciation information generating step, the name pronunciation information generating module may convert the name text information into a pronunciation string form, and then generate the name pronunciation information by unifying a similar pronunciation sequence.

또한, 상기 이름 텍스트 정보가 한글 이외의 언어인 외국어로 구성된 경우, 상기 이름 발음 정보 생성 단계에서, 상기 이름 발음 정보 생성 모듈은 외국어의 상기 이름 텍스트 정보를 한글의 상기 이름 텍스트 정보로 전환한 뒤에, 상기 이름 텍스트 정보를 발음 문자열 형태로 변환하여 상기 이름 발음 정보를 생성하는 것을 특징으로 할 수 있다.In addition, when the name text information is composed of a foreign language that is a language other than Korean, in the name pronunciation information generating step, the name pronunciation information generation module converts the name text information of the foreign language into the name text information of Korean, The name text information may be converted into a pronunciation string form to generate the name pronunciation information.

상기한 바와 같이, 본 발명에 의하면 이하와 같은 효과가 있다.As described above, the present invention has the following effects.

첫째, 본 발명의 일실시예에 따르면, 다양한 화자의 음성/영상에 대하여 해당 화자의 음성으로 사용자를 호출하는 서비스가 가능해지는 효과가 발생된다. First, according to an embodiment of the present invention, a service for calling a user with the voice of a corresponding speaker is possible with respect to the voice/video of various speakers.

둘째, 본 발명의 일실시예에 따르면, 다양한 언어에 대하여 각각 발음규칙을 기설정하지 않아도 되는 효과가 발생된다.Second, according to an embodiment of the present invention, the effect of not having to preset pronunciation rules for various languages is generated.

셋째, 본 발명의 일실시예에 따르면, 발음 통일 및 외국어의 한글 전환에 의해 음성 합성 엔진의 정확도가 향상되고, 필요 학습 데이터의 양이 저감되는 효과가 발생된다.Third, according to an embodiment of the present invention, the accuracy of the speech synthesis engine is improved and the amount of required learning data is reduced by unification of pronunciation and conversion of foreign languages to Hangul.

본 명세서에 첨부되는 다음의 도면들은 본 발명의 바람직한 실시예를 예시하는 것이며, 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 발명은 그러한 도면에 기재된 사항에만 한정되어 해석되어서는 아니 된다.
도 1은 본 발명의 일실시예에 따른 콜미 서비스를 위한 음성 합성 장치를 도시한 모식도,
도 2, 3은 본 발명의 일실시예에 따라 이름 발음 정보 생성 모듈이 이름 발음 정보를 생성하는 흐름도를 도시한 모식도,
도 4는 본 발명의 다른 실시예에 따른 이름 발음 정보 생성 모듈의 발음 통일 인공신경망을 도시한 모식도,
도 5는 언어 특성 벡터 생성 모듈의 언어 특성 벡터 생성을 도시한 모식도,
도 6은 본 발명의 일실시예에 다른 음향 특성 벡터 생성 모듈의 음향 특성 벡터 생성을 도시한 모식도,
도 7은 본 발명의 일실시예에 따른 이름 음성 정보 병합 모듈의 이름 음성 정보의 병합을 도시한 모식도이다.The following drawings attached to the present specification illustrate preferred embodiments of the present invention, and serve to further understand the technical idea of the present invention together with the detailed description of the present invention, so the present invention is limited only to the matters described in such drawings. And should not be interpreted.
1 is a schematic diagram showing a speech synthesis apparatus for a call me service according to an embodiment of the present invention;
2 and 3 are schematic diagrams showing a flowchart of a name pronunciation information generation module generating name pronunciation information according to an embodiment of the present invention;
4 is a schematic diagram showing a pronunciation unified artificial neural network of a name pronunciation information generation module according to another embodiment of the present invention;
5 is a schematic diagram showing the language feature vector generation by the language feature vector generation module;
6 is a schematic diagram showing the generation of an acoustic characteristic vector by an acoustic characteristic vector generation module according to an embodiment of the present invention;
7 is a schematic diagram illustrating merging of name voice information of a name voice information merging module according to an embodiment of the present invention.

이하 첨부된 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명을 쉽게 실시할 수 있는 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예에 대한 동작원리를 상세하게 설명함에 있어서 관련된 공지기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다.Hereinafter, exemplary embodiments in which the present invention can be easily implemented by those of ordinary skill in the art will be described in detail with reference to the accompanying drawings. However, when it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention in describing the operating principle of the preferred embodiment of the present invention in detail, the detailed description thereof will be omitted.

또한, 도면 전체에 걸쳐 유사한 기능 및 작용을 하는 부분에 대해서는 동일한 도면 부호를 사용한다. 명세서 전체에서, 특정 부분이 다른 부분과 연결되어 있다고 할 때, 이는 직접적으로 연결되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고, 간접적으로 연결되어 있는 경우도 포함한다. 또한, 특정 구성요소를 포함한다는 것은 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다.In addition, the same reference numerals are used for portions having similar functions and functions throughout the drawings. Throughout the specification, when a specific part is said to be connected to another part, this includes not only a case in which it is directly connected, but also a case in which it is indirectly connected with another element interposed therebetween. In addition, the inclusion of a specific component does not exclude other components unless otherwise stated, but means that other components may be further included.

이하 발명의 설명에서, 한글은 낱소리(음소) 문자로서 자음(닿소리)과 모음(홑소리)으로 구성되며, 음절(소리마디)은 첫소리(초성), 가운뎃소리(중성), 끝소리(종성)의 낱소리(음소) 세 벌로 이루어지는데, 첫소리(초성)와 끝소리(종성)에는 닿소리(자음)를 쓰고 가운뎃소리(중성)에는 홀소리(모음)를 쓴다.In the following description of the invention, Hangul is composed of a consonant (touch sound) and a vowel (single sound) as a single sound (phoneme) character, and the syllable (sound node) consists of the first sound (initial sound), the middle sound (neutral), and the ending sound (last sound). It consists of three sets of single sounds (phonemes), with touch sounds (consonants) used for the first sound (initial sound) and the last sound (ending sound), and a hall sound (vowel sound) for the middle sound (neutral).

현대 한글에서 단음을 내는 닿소리(자음)에는 ㄱ,ㄴ,ㄷ,ㄹ,ㅁ 등 14자가 있고, 홑소리(모음)에는 ㅏ,ㅑ, ㅓ, ㅕ 등 10자가 있다. 복음을 내는 겹닿소리(쌍자음)에는 ㄲ, ㄸ, ㅃ, ㅆ, ㅉ 의 5자가 있고, 겹홑소리(쌍모음)에는 ㅐ, ㅒ, ㅔ, ㅖ 등 11자가 있다. 또한, 현대 한글에서 끝소리(종성)가 있을 때 활용되는 받침은 홑받침 또는 곁받침이 있고, 홑받침에는 모든 닿소리(자음)가 쓰이며, 곁받침에는 ㄲ,ㅆ,ㄳ,ㄵ 등 13자가 있다. In modern Hangeul, there are 14 characters such as ㄱ,ㄴ, ㄷ, ㄹ, ㅁ, etc. in the touch sound (consonant) that makes a single note, and 10 characters such as ㅏ, ㅑ, ㅓ, and ㅕ are included in the single sound (vowel). There are 5 characters of ㄲ, ㄸ, ㅃ, ㅆ, ㅉ in the double consonant sound (double consonant) that makes the Gospel, and 11 characters such as ㅐ, ㅒ, ㅔ, ㅖ, etc. In addition, in modern Hangeul, when there is a final sound (jongseong), there is a single base or side base, all touch sounds (consonants) are used on the base, and 13 characters such as ㄲ, ㅆ, ㄳ, and ㄵ are used in the side base.

현대 한글은 낱자를 엮어 11,172(첫소리 19 × 가운뎃소리 21 × (끝소리 27 + 끝소리 없음 1))글자 마디를 쓸 수 있다. 11,172자 중 399자는 무받침 글자이며 10,773자는 받침 글자이다.Modern Hangeul can write 11,172 words (19 first sound × 21 middle sound × (27 ending + 1 no ending sound)) by weaving single characters. Of the 11,172 characters, 399 are non-supporting characters and 10,773 are supporting characters.

어문 규정에 의하여, 현대 한국어 표준어에서 실제 사용하는 음절은 이보다 적다. 한국어의 소리는 첫소리+가운뎃소리(+끝소리)로 이루어지는데, 표준어에서 첫소리에는 19가지 닿소리가 모두 쓰이되 첫소리에 놓인 ㅇ은 소리 나지 않는다. 끝소리는 7종성법에 따라 7갈래로 모이며 끝소리가 없는 것까지 더하여 모두 8갈래이므로 현대 한국어의 발음은 첫소리 19 × 가운뎃소리 21 × 끝소리 8 = 3,192가지 소리가 된다. 표준 발음법을 따르면 구개음 ㅈ, ㅉ, ㅊ 뒤의 이중 모음 ㅑ, ㅒ, ㅕ, ㅖ, ㅛ, ㅠ는 단모음 ㅏ, ㅐ, ㅓ, ㅔ, ㅗ, ㅜ로 소리나므로 첫소리 3 × 가운뎃소리 6 × 끝소리 8 = 144소리가 빠지고, 아울러 소리나는 첫소리 (ㅇ이 아닌 첫소리 뒤에 오는)를 얹은 가운뎃소리 [ㅢ]는 ㄴ을 제외하면(ㄴ의 경우는 구개음화에 따른 다른 음소로 인정하고 있다.) [ㅣ]로 소리나므로(한글 맞춤법 제9항 및 표준 발음법 제5항 단서 3) 첫소리 17 × 가운뎃소리 1 × 끝소리 8 = 136 소리가 다시 빠진다. 따라서, 현재 한국어 표준어에서 실제 사용하는 소리마디는 3192 - 144 - 136 = 2,912가지가 된다.According to the language regulations, there are fewer syllables actually used in the modern Korean standard language. The sound of Korean is composed of the first sound + the middle sound (+ the end sound). In the standard language, all 19 touch sounds are used for the first sound, but the ㅇ in the first sound does not sound. End sounds are gathered in 7 segments according to the 7-segment method, and all 8 prongs are added to the one with no ending sound, so the pronunciation of modern Korean is 19 first sound × 21 middle sound × 8 last sound = 3,192 sounds. According to the standard pronunciation method, double vowels ㅑ, ㅒ, ㅕ, ㅖ, ㅛ, ㅠ are short vowels ㅏ, ㅐ, ㅓ, ㅔ, ㅗ, ㅜ, so the first sound is 3 × middle sound. 8 = 144 The middle sound [ㅢ] with the first sound (which comes after the first sound instead of ㅇ) is omitted, except for b (in the case of b, it is recognized as another phoneme according to palatalization.) [ㅣ] Because it sounds as (Hangul Spelling Clause 9 and Standard Pronunciation Clause 5 Clue 3), the first sound 17 × the middle sound 1 × the end sound 8 = 136 sounds are dropped again. Therefore, currently 3192-144-136 = 2,912 types of phonetic measures actually used in the standard Korean language.

본 발명의 범위는 한글에 한정되지 않고 영어, 일본어, 중국어 등 다양한 국가의 언어로 콜미 서비스가 적용되는 범위를 포함할 수 있다. The scope of the present invention is not limited to Korean and may include a range to which the call me service is applied in languages of various countries such as English, Japanese, and Chinese.

콜미 서비스를 위한 음성 합성 장치 및 방법Speech synthesis apparatus and method for call me service

도 1은 본 발명의 일실시예에 따른 콜미 서비스를 위한 음성 합성 장치(1)를 도시한 모식도이다. 도 1에 도시된 바와 같이, 본 발명의 일실시예에 따른 콜미 서비스를 위한 음성 합성 장치(1)는, 이름 텍스트 정보 수신 모듈(10), 이름 발음 정보 생성 모듈(11), 언어 특성 벡터 생성 모듈(12), 음향 특성 벡터 생성 모듈(13), 이름 음성 정보 생성 모듈(14), 이름 음성 정보 병합 모듈(15)을 포함한다. 1 is a schematic diagram showing a speech synthesis apparatus 1 for a call me service according to an embodiment of the present invention. As shown in FIG. 1, the speech synthesis apparatus 1 for a call me service according to an embodiment of the present invention includes a name text information receiving module 10, a name pronunciation information generating module 11, and a language characteristic vector generation. A module 12, an acoustic characteristic vector generation module 13, a name speech information generation module 14, and a name speech information merging module 15 are included.

이름 텍스트 정보 수신 모듈(10)는, 사용자 클라이언트(100) 또는 특정 서비스 서버(200)에서 사용자 또는 특정인에 대한 이름 텍스트 정보(문자열)를 수신하는 모듈이다. 이름 텍스트 정보는, 예를 들어, '김민아', 'Serena', 'Johnson', 'Daniel James' 등과 같이 성_이름, 이름, 성, 이름_성의 텍스트를 포함하는 정보를 의미한다. 사용자 클라이언트(100)에서는, 특정 웹사이트 또는 특정 애플리케이션의 인터페이스에 구성된 입력창에 사용자가 사용자 또는 특정인에 대한 이름 텍스트 정보를 입력하고, 입력된 이름 텍스트 정보를 사용자 클라이언트(100) 또는 해당 웹사이트 또는 해당 애플리케이션의 서비스 서버(200)에서 유무선 네트워크를 통해 이름 텍스트 정보 수신 모듈(10)에 송신하도록 구성될 수 있다. 특정 서비스 서버(200)에서는, 상기 서비스 서버(200)에서 서비스하는 특정 웹사이트 또는 특정 애플리케이션에서 사용자의 요청(콜미 서비스 요청, 특정 페이지 유입과 같은 기설정된 특정 액션 등)을 기초로 상기 서비스 서버(200)와 연결된 데이터베이스에 기저장된 사용자 또는 특정인에 대한 이름 텍스트 정보를 불러오고, 불러온 이름 텍스트 정보를 서비스 서버(200)에서 유무선 네트워크를 통해 이름 텍스트 정보 수신 모듈(10)에 송신하도록 구성될 수 있다. 이에 따르면, 사용자가 직접 이름을 입력하여 콜미 서비스를 요청하는 UX/UI 이외에도, 사용자의 선택에 의해 제3의 사용자 계정에 대한 이름 텍스트 정보가 적용된 콜미 영상을 SMS나 KakaoTalk, Line, Facebook messanger 등의 메신저 애플리케이션, Facebook, Instagram, Snap, Snow 등의 소셜 미디어 애플리케이션을 통해 이모티콘이나 기프티콘과 유사한 UX/UI로 활용할 수 있게 되는 효과가 발생된다. The name text information receiving module 10 is a module that receives name text information (string) for a user or a specific person from the user client 100 or a specific service server 200. Name text information refers to information including text of last name, first name, last name, and first name, such as'Minah Kim','Serena','Johnson', and'Daniel James'. In the user client 100, a user enters name text information for a user or a specific person in an input window configured in an interface of a specific website or a specific application, and the input name text information is sent to the user client 100 or the corresponding website or The service server 200 of the application may be configured to transmit the name text information receiving module 10 through a wired or wireless network. In the specific service server 200, the service server ( 200), the name text information for a user or a specific person previously stored in the database connected to) may be loaded, and the imported name text information may be transmitted from the service server 200 to the name text information receiving module 10 through a wired or wireless network. have. According to this, in addition to the UX/UI in which the user directly enters the name to request the call-me service, the call-me video to which the name text information for the third user account is applied by the user's selection, such as SMS, KakaoTalk, Line, Facebook messanger, etc. Messenger applications, social media applications such as Facebook, Instagram, Snap, and Snow can be used as UX/UI similar to emoticons or gifticons.

이름 발음 정보 생성 모듈(11)은 이름 텍스트 정보 수신 모듈(10)에서 이름 텍스트 정보를 수신하고, 수신된 이름 텍스트 정보가 발음 문자열 형태로 변환된 이름 발음 정보를 생성하는 모듈이다. 이름 발음 정보는, 예를 들어, '김미나', 'suruna', 'joanson', 'daeniul jaims' 등과 같이 발음되는 그대로의 문자열 형태로 변환된 텍스트 정보를 의미한다. 도 2, 3은 본 발명의 일실시예에 따라 이름 발음 정보 생성 모듈(11)이 이름 발음 정보를 생성하는 흐름도를 도시한 것이다. 도 2, 3에 도시된 바와 같이, 본 발명의 일실시예에 따른 이름 발음 정보 생성 모듈(11)의 이름 발음 정보 생성은, 음소 단위 분해 단계, 발음 규칙 적용 단계, 발음 통일 단계, 이름 발음 정보 생성 단계를 포함한 프로그램 코드로 수행될 수 있고, 발음 통일 인공신경망 및 한글 전환 인공신경망을 포함할 수 있다. The name pronunciation information generation module 11 is a module that receives name text information from the name text information receiving module 10 and generates name pronunciation information in which the received name text information is converted into a pronunciation string format. The name pronunciation information refers to text information converted into a string format as it is pronounced, such as'kimmina','suruna','joanson','daeniul jaims', and the like. 2 and 3 are flowcharts illustrating the name pronunciation information generation module 11 generating name pronunciation information according to an embodiment of the present invention. 2 and 3, the name pronunciation information generation by the name pronunciation information generation module 11 according to an embodiment of the present invention includes a phoneme unit decomposition step, a pronunciation rule application step, a pronunciation unification step, and name pronunciation information. It may be performed with a program code including a generating step, and may include a pronunciation unified artificial neural network and a Hangul conversion artificial neural network.

음소 단위 분해 단계는, 이름 발음 정보 생성 모듈(11)이 수신된 이름 텍스트 정보를 음운론상의 최소 단위인 음소(낱소리)로 분해하여 음소 정보를 생성하는 단계이다. 음소는 해당 언어의 화자들이 동일하다고 인지하는 말소리들의 최소 묶음이며, 자음(또는, 자음 음소)과 모음(또는, 모음 음소)으로 구성될 수 있다. 이름 텍스트 정보가 한글인 경우, 음소 단위 분해 단계에서는 이름 텍스트 정보를 아래와 같은 자음 음소 및 모음 음소로 분해하여 음소 정보를 생성하도록 구성될 수 있다. 현대 한글에서 단음을 내는 닿소리(자음)에는 ㄱ,ㄴ,ㄷ,ㄹ,ㅁ 등 14자가 있고, 홑소리(모음)에는 ㅏ,ㅑ, ㅓ, ㅕ 등 10자가 있다. 복음을 내는 겹닿소리(쌍자음)에는 ㄲ, ㄸ, ㅃ, ㅆ, ㅉ 의 5자가 있고, 겹홑소리(쌍모음)에는 ㅐ, ㅒ, ㅔ, ㅖ 등 11자가 있다. 또한, 현대 한글에서 끝소리(종성)가 있을 때 활용되는 받침은 홑받침 또는 곁받침이 있고, 홑받침에는 모든 닿소리(자음)가 쓰이며, 곁받침에는 ㄲ,ㅆ,ㄳ,ㄵ 등 13자가 있다. The phoneme unit decomposition step is a step in which the name pronunciation information generation module 11 decomposes the received name text information into phonemes (one sound), which is the smallest phonological unit, to generate phoneme information. A phoneme is a minimum bundle of speech sounds that speakers of a corresponding language perceive to be the same, and may be composed of a consonant (or consonant phoneme) and a vowel (or a vowel phoneme). When the name text information is Korean, the phoneme unit decomposition step may be configured to generate phoneme information by decomposing the name text information into consonant phonemes and vowel phonemes as follows. In modern Hangeul, there are 14 characters such as ㄱ,ㄴ, ㄷ, ㄹ, ㅁ, etc. in the touch sound (consonant) that makes a single note, and 10 characters such as ㅏ, ㅑ, ㅓ, and ㅕ are included in the single sound (vowel). There are 5 characters of ㄲ, ㄸ, ㅃ, ㅆ, ㅉ in the double consonant sound (double consonant) that makes the Gospel, and 11 characters such as ㅐ, ㅒ, ㅔ, ㅖ, etc. In addition, in modern Hangeul, when there is a final sound (jongseong), there is a single base or side base, all touch sounds (consonants) are used on the base, and 13 characters such as ㄲ, ㅆ, ㄳ, and ㄵ are used in the side base.

발음 규칙 적용 단계는, 이름 발음 정보 생성 모듈(11)이 음소 단위 분해 단계에서 생성한 이름 텍스트 정보에 대한 음소 정보에 발음 규칙을 적용하여 상기 음소 정보의 일부를 발음열로 변환하여 발음열 정보를 생성하는 단계이다. 발음 규칙 적용 단계에서 이름 발음 정보 생성 모듈(11)은 경음화, 자음동화, 구개음화 등의 표준어 규정 제2부의 표준발음법의 발음 규칙을 기초로 음소 정보의 일부를 발음열로 변환하도록 구성될 수 있다. In the pronunciation rule application step, the name pronunciation information generation module 11 applies a pronunciation rule to phoneme information for the name text information generated in the phoneme unit decomposition step, and converts part of the phoneme information into a phonetic sequence to convert the phonetic sequence information. This is the step of creating. In the pronunciation rule application step, the name pronunciation information generation module 11 may be configured to convert part of the phoneme information into a pronunciation sequence based on the pronunciation rules of the standard pronunciation method of the second part of the standard word regulation such as hard phoneticization, consonant assimilation, and palatal speech. .

이름 발음 정보 생성 모듈(11)에서 음소 정보에 적용하는 발음 규칙으로는 끝소리 규칙(받침소리로는 'ㄱ,ㄴ,ㄷ,ㄹ,ㅁ,ㅂ,ㅇ'의 7개 자음만으로 발음열 구성, 밖[ㅂㅏㄱ], 꽃[ㄲㅗㄷ], 무릎[ㅁㅜㄹㅡㅂ]), 연음 규칙(받침이 모음으로 시작되는 음소와 결합되는 경우에는 뒤 음절 첫소리로 옮겨서 발음열 구성, 밥이[ㅂㅏㅂㅣ], 꽃을[ㄲㅗㅊㅡㄹ]), 자음동화 제1규칙(받침 ㄱ(ㄲ, ㅋ, ㄱㅅ, ㄹㄱ), ㄷ(ㅅ, ㅆ, ㅈ, ㅊ, ㅌ, ㅎ), ㅂ(ㅍ, ㄹㅂ, ㄹㅍ, ㅂㅅ)은 'ㄴ,ㅁ' 앞에서 [ㅇ,ㄴ,ㅁ]으로 발음열 구성, 먹는[ㅁㅓㅇㄴㅡㄴ], 있는[ㅇㅣㄴㄴㅡㄴ], 앞마당[ㅇㅏㅁㅁㅏㄷㅏㅇ]), 자음동화 제2규칙(받침 'ㅁ,ㅇ' 뒤에 연결되는 'ㄹ'은 [ㄴ]으로 발음열 구성), 자음동화 제3규칙('ㄴ'은 'ㄹ'의 앞이나 뒤에서 [ㄹ]로 발음열 구성), 구개음화 규칙(받침 'ㄷ,ㅌ(ㄹㅌ)'이 모음'ㅣ'와 결합되는 경우에는 [ㅈ,ㅊ]로 바꾸어서 뒤 음절 첫소리로 옮겨 발음열 구성), 격음화 규칙('ㅂ,ㄷ,ㅈ,ㄱ'이 'ㅎ'의 앞 또는 뒤에 위치하는 경우 격음화하여 [ㅍ,ㅌ,ㅊ,ㅋ]으로 발음열 구성), 경음화 규칙(받침 ㄱ(ㄲ, ㅋ, ㄱㅅ, ㄹㄱ), ㄷ(ㅅ, ㅆ, ㅈ, ㅊ, ㅌ, ㅎ), ㅂ(ㅍ, ㄹㅂ, ㄹㅍ, ㅂㅅ)의 뒤에 연결되는 'ㄱ,ㄷ,ㅂ,ㅅ,ㅈ'는 된소리로 발음열 구성, 받침 'ㄴ(ㄴㅈ),ㅁ(ㄹㅁ),ㄹ'의 뒤에 결합되는 어미의 첫소리가 'ㄱ,ㄷ,ㅅ,ㅈ'인 경우에는 된소리로 발음열 구성), 사잇소리 'ㄴ' 첨가 규칙(앞 음절의 끝이 자음이고 뒤 음절이 '이,야,여,요,유'인 경우에는 'ㄴ'음을 첨가하여 [니,냐,녀,뇨,뉴]로 발음열 구성하고, 'ㄹ' 받침 뒤에 첨가되는 'ㄴ'음은 [ㄹ]로 발음열 구성), 사잇소리 'ㅅ' 첨가 규칙('ㄱ,ㄷ,ㅂ,ㅅ,ㅈ'로 시작하는 음절 앞에 사이시옷이 올 때는 사이시옷을 제외하고 발음열 구성(깃발[기빨])하고, 사이시옷 뒤에 'ㄴ,ㅁ'이 결합되는 경우에는 [ㄴ]으로 발음열 구성(콧물[콘물])하며, 사이시옷 뒤에 '이'가 결합되는 경우에는 [ㄴㄴ]으로 발음열 구성(깻잎[깬닙])) 등이 포함될 수 있다. As the pronunciation rule applied to phoneme information in the name pronunciation information generation module 11, the pronunciation rule consists of only 7 consonants of'ㄱ,ㄴ,ㄷ,ㄹ,ㅁ,ㅂ,ㅇ' as the supporting sound, and the outside [ㅂㅏㄱ], flower [ㄲㅗㄷ], knee [ㅁㅜㄹㅡㅂ]), linkage rule (if the base is combined with a phoneme that starts with a vowel, it moves to the first sound of the latter syllable to form a pronunciation sequence, and Bab is [ㅂㅏㅂㅣ) ], Flower [ㄲㅗㅊㅡㄹ]), the first rule of consonant fairy tale (basement ㄱ(ㄲ, ㅋ, ㄱㅅ, ㄹㄱ), ㄷ(ㅅ, ㅆ, ㅈ, ㅊ, ㅌ, ㅎ), ㅂ(ㅂ, ㄹㅍ, ㄹㅂ) , ㄹㅍ, ㅂㅅ) consists of a pronunciation column with [ㅇ,ㄴ,ㅁ] in front of'ㄴ,ㅁ', eating [ㅁㅓㅇㄴㅡㄴ], with [ㅇㅣㄴㄴㅡㄴ], front yard [ㅇㅏㅁㅁㅏㄷㅏㅇ]), The second rule of consonant assimilation ('ㄹ' connected after the support'ㅁ,ㅇ' consists of a pronunciation string with [ㄴ]), the third rule of consonant assimilation ('ㄴ' is pronounced as [ㄹ] before or after'ㄹ') Column composition), palatalization rule (if the support'ㄷ,ㅌ(ㄹㅌ)' is combined with the vowel'ㅣ', change it to [ㅈ,ㅊ] and move it to the beginning of the syllable to form a phonetic sequence), aphrodisiac rule ('ㅂ,ㄷ If ,ㅈ,ㄱ' is located before or after'ㅎ', it is aphrodisiac and consists of a pronunciation string as [ㅍ,ㅌ,ㅊ,ㅋ]), hard phonetic rules (support ㄱ(ㄲ, ㅋ, ㄱㅅ, ㄹㄱ), ㄷ( 'ㄱ,ㄷ,ㅂ,ㅅ,ㅈ' connected after ㅅ, ㅆ, ㅈ, ㅊ, ㅌ, ㅎ), ㅂ(ㅍ, ㄹㅂ, ㄹㅍ, ㅂㅅ) consists of a pronunciation string with a honed sound, and the base'ㄴ(ㄴㅈ) ),ㅁ(ㄹㅁ), ㄹ'in the case that the first sound of the ending is'ㄱ,ㄷ,ㅅ,ㅈ', the pronunciation sequence is composed of old sound), and the rule for adding the sis sound'ㄴ' (the end of the preceding syllable is a consonant If the following syllable is'i, hey, yeo, yo, yu', then'b' sound is added to form the pronunciation sequence as [ni, nya, woman, nya, new], and'added after'ㄹ' ㄴ'sound consists of a phonetic sequence as [ㄹ]), rules for adding sissori'ㅅ' (when a sishiot comes before a syllable starting with'ㄱ,ㄷ,ㅂ,ㅅ,ㅈ', except for the sishiot (Flag [gimeol]), and if'ㄴ,ㅁ' is combined behind the sishicoat, the pronunciation column is composed of [ㄴ] (runny nose [conmul]), and if'i' is combined behind the sishicoat, [ㄴㄴ] can be included in the composition of the pronunciation column (seed leaves [Kennip])).

발음 통일 단계는, 이름 발음 정보 생성 모듈(11)이 발음 규칙 적용 단계에서 생성한 발음열 정보에서 중복되는 발음열(유사 발음열)을 통일하여 수정 발음열 정보를 생성하는 단계이다. 예를 들어, 발음 통일 단계에서는 'ㅞ', 'ㅚ', 'ㅙ' 등의 발음열을 'ㅙ'로 통일하여 수정 발음열 정보를 생성하게 된다. 일예로 'ㅇ ㅣ ㅁ ㅎ ㅖ ㅇ ㅕ ㄴ'의 음소 정보가 입력되면 'ㅁㅎ'를 'ㅁ'로 통일하고, 'ㅖ'를 유사 발음열인 'ㅐ'로 통일하여 'ㅇㅣㅁㅐㅇㅕㄴ'의 수정 발음열 정보를 생성하게 된다. 본 발명의 일실시예에 따른 발음 통일 단계에 따르면, 음향 특성 벡터 생성 모듈의 정확도 등의 성능이 향상되고 학습 시간이 단축되는 효과가 발생된다.In the pronunciation unification step, the name pronunciation information generation module 11 generates corrected pronunciation sequence information by unifying the duplicate pronunciation sequence (similar pronunciation sequence) from the pronunciation sequence information generated in the pronunciation rule application step. For example, in the pronunciation unification step, corrected pronunciation sequence information is generated by unifying pronunciation sequences such as'ㅞ','ㅚ', and'ㅙ' into'ㅙ'. For example, when the phoneme information of'ㅇ ㅣ ㅁ ㅎ ㅖ ㅇ ㅕ ㄴ'is input,'ㅁㅎ' is unified as'ㅁ', and'ㅖ' is unified as'ㅐ', a similar pronunciation string, and'ㅇㅣㅁㅐㅇㅕㄴ The correct pronunciation sequence information of 'is generated. According to the pronunciation unification step according to an embodiment of the present invention, the performance such as accuracy of the acoustic characteristic vector generation module is improved and the learning time is shortened.

본 발명의 일실시예에 따른 이름 발음 정보 생성 모듈(11)은 음소 사이의 발음 거리를 기초로 생성된 유사 발음열을 이용하여 수정 발음열 정보를 생성하도록 구성될 수 있다. 음소 간의 발음 거리는 본 발명의 일실시예에 따른 음향 특성 벡터 생성 모듈(13)에 각 음소를 입력하여 생성되는 음향 특성 벡터 사이의 거리를 의미할 수 있다. 이에 따르면, 음향 특성 벡터 생성 모듈(13)의 학습에 활용된 화자의 음성 특성에 따라 유사 발음열이 화자 맞춤형으로 생성될 수 있는 효과가 발생된다.The name pronunciation information generation module 11 according to an embodiment of the present invention may be configured to generate corrected pronunciation sequence information using a similar pronunciation sequence generated based on a pronunciation distance between phonemes. The pronunciation distance between phonemes may mean a distance between acoustic characteristic vectors generated by inputting each phoneme to the acoustic characteristic vector generation module 13 according to an embodiment of the present invention. According to this, a similar pronunciation sequence can be generated tailored to the speaker according to the speaker's speech characteristic used for learning by the acoustic characteristic vector generation module 13.

본 발명의 다른 실시예에 따른 이름 발음 정보 생성 모듈(11)은 입력 데이터를 발음열 정보로 하고 출력 데이터를 수정 발음열 정보로 하는 기학습된 Sequance to sequance 인공신경망 모듈인 발음 통일 인공신경망을 포함할 수 있다. 도 4는 본 발명의 다른 실시예에 따른 이름 발음 정보 생성 모듈의 발음 통일 인공신경망을 도시한 모식도이다. 도 4에서 <go>는 수정 발음열 정보의 시작, <eos>는 수정 발음열 정보의 끝을 나타내는 시그널이다. 도 3에 도시된 바와 같이, 본 발명의 일실시예에 따른 발음 통일 인공신경망은 입력 데이터를 발음열 정보로 하고 출력 데이터를 수정 발음열 정보로 하는 GRU(Gated recurrent unit) 셀 또는 LSTM(Long-Short Term Memory) 셀을 포함하는 순환신경망(RNN, Recurrent Neural Network)으로 구성될 수 있으며, 학습 단계에서는 입력된 발음열 정보에 의해 발음 통일 인공신경망에서 출력된 수정 발음열 정보와 음소 사이의 발음 거리를 기초로 생성된 유사 발음열을 이용하여 생성된 수정 발음열 정보(Ground truth)와의 차이를 기초로 한 Back Propagation 등의 방법으로 인공신경망을 학습시킬 수 있다. 특히, 발음 통일 인공신경망의 Loss function은 입력 데이터인 발음열 정보의 음향 특성 벡터(음향 특성 벡터 생성 모듈(13)에 발음열 정보를 입력하여 생성)와 출력 데이터인 수정 발음열 정보의 음향 특성 벡터(음향 특성 벡터 생성 모듈(13)에 수정 발음열 정보를 입력하여 생성)에 대한 MSE(Mean Squared Error) 또는 크로스 엔트로피(Cross Entropy) 등의 손실함수로 구성될 수 있다. 이에 따르면, 음소나 음절의 사용을 기초로 음소 또는 단어를 임베딩하는 Word2Vec 등의 임베딩 모듈로 임베딩된 발음열 정보 및 수정 발음열 정보의 거리(차이)를 저감시키는 방향으로 발음 통일 인공신경망 모듈이 학습되는 것이 아니라, 발음열 정보 및 수정 발음열 정보의 음성적인 차원에서의 거리(차이)를 저감시키는 방향으로 발음 통일 인공신경망 모듈이 학습되는 효과가 발생된다. The name pronunciation information generation module 11 according to another embodiment of the present invention includes a pronunciation unified artificial neural network, which is a pre-learned sequence to sequance artificial neural network module that uses input data as pronunciation sequence information and output data as corrected pronunciation sequence information. can do. 4 is a schematic diagram showing a unified pronunciation artificial neural network of a name pronunciation information generation module according to another embodiment of the present invention. In FIG. 4, <go> is a signal indicating the start of the corrected pronunciation sequence information, and <eos> is a signal indicating the end of the corrected pronunciation sequence information. As shown in FIG. 3, in the unified pronunciation artificial neural network according to an embodiment of the present invention, a gated recurrent unit (GRU) cell or LSTM (Long- Short Term Memory) It can be composed of a Recurrent Neural Network (RNN) including cells, and in the learning stage, the pronunciation distance between the corrected pronunciation sequence information output from the unified pronunciation artificial neural network based on the input pronunciation sequence information and the phoneme The artificial neural network may be trained by a method such as back propagation based on a difference from ground truth generated by using the similar pronunciation sequence generated based on. In particular, the loss function of the unified pronunciation artificial neural network is an acoustic characteristic vector of the pronunciation string information as input data (generated by inputting the pronunciation string information to the acoustic characteristic vector generation module 13) and the acoustic characteristic vector of the modified pronunciation string information as output data. It may be composed of a loss function such as a mean squared error (MSE) or cross entropy for (generated by inputting corrected pronunciation sequence information to the acoustic characteristic vector generation module 13). According to this, the pronunciation unified artificial neural network module learns in a direction to reduce the distance (difference) between the embedded phonetic sequence information and the corrected phonetic sequence information by an embedding module such as Word2Vec that embeds phonemes or words based on the use of phonemes or syllables. Rather, the pronunciation unified artificial neural network module is learned in a direction to reduce the distance (difference) in the phonetic dimension of the pronunciation sequence information and the corrected pronunciation sequence information.

또한, 본 발명의 다른 실시예에 따르면, 발음 통일 인공신경망의 Loss function은 입력 데이터인 발음열 정보의 음향 특성 벡터(음향 특성 벡터 생성 모듈(13)에 발음열 정보를 입력하여 생성)와 출력 데이터인 수정 발음열 정보의 음향 특성 벡터(음향 특성 벡터 생성 모듈(13)에 수정 발음열 정보를 입력하여 생성)에 대한 MSE(Mean Squared Error) 또는 크로스 엔트로피(Cross Entropy) 등의 손실함수(단어 손실함수) 및 발음열 정보를 구성하는 음소 중 발음 통일의 대상이 되는 제1음소의 음향 특성 벡터와 수정 발음열 정보를 구성하는 음소 중 상기 제1음소가 발음 통일이 되어 변경된 제2음소의 음향 특성 벡터에 대한 대한 MSE(Mean Squared Error) 또는 크로스 엔트로피(Cross Entropy) 등의 손실함수(음소 손실함수)를 포함하는 손실함수로 구성될 수 있다. 이에 따르면, 발음열 정보와 수정 발음열 정보의 발음 유사도 및 치환된 음소의 발음 유사도를 모두 고려하여 발음 통일 인공신경망 모듈이 학습되는 효과가 발생된다. In addition, according to another embodiment of the present invention, the loss function of the unified pronunciation artificial neural network is an acoustic characteristic vector of pronunciation string information as input data (generated by inputting pronunciation string information to the acoustic characteristic vector generation module 13) and output data. A loss function such as MSE (Mean Squared Error) or Cross Entropy (word loss) for the acoustic characteristic vector (generated by inputting the corrected pronunciation string information into the acoustic characteristic vector generation module 13) of the phosphorus-corrected phonetic sequence information. Function) and the acoustic characteristic vector of the first phoneme that is the target of pronunciation unification among the phonemes constituting the pronunciation sequence information, and the acoustic property of the second phoneme changed as the first phoneme among the phonemes constituting the corrected pronunciation string information becomes pronunciation unified It may be composed of a loss function including a loss function (a phoneme loss function) such as MSE (Mean Squared Error) or Cross Entropy for the vector. According to this, the pronunciation unified artificial neural network module is trained in consideration of both the pronunciation similarity of the pronunciation sequence information and the corrected pronunciation sequence information and the pronunciation similarity of the substituted phoneme is generated.

본 발명의 다른 실시예에 따른 이름 발음 정보 생성 모듈(11)의 RNN 모듈의 hidden layer인 현재인 시간 t 상태(state)의 hidden vector는 이하 수학식 1과 같이 이전 시간인 t-1의 상태(state)와 입력 벡터의 함수로 업데이트 되도록 구성될 수 있다. The hidden vector of the current time t state, which is the hidden layer of the RNN module of the name pronunciation information generating module 11 according to another embodiment of the present invention, is the state of the previous time t-1 ( state) and can be configured to be updated as a function of the input vector.

위 수학식에서, h_t는 시간 t의 Hidden vector, W_hh는 이전 시간인 t-1의 Hidden vector h_t-1에 대한 가중치, w_xh는 시간 t의 입력 벡터 x_t에 대한 가중치, b_h는 상수를 의미할 수 있다. In the above equation, h _t is the hidden vector of time t, W _hh is the weight of the hidden vector h _t-1 _{of the previous time t-1, w xh} is the weight of the input vector x _t of time t, and b _h is It can mean a constant.

이에 따르면, 이름 텍스트 정보, 발음열 정보에 대하여 유사 발음열의 규칙을 모두 기설정하지 않아도, 다양한 이름 텍스트 정보 또는 발음열 정보에 대하여 수정 발음열 정보를 생성할 수 있게 되는 효과가 발생된다.Accordingly, it is possible to generate corrected pronunciation sequence information for various name text information or pronunciation sequence information even if all rules of the similar pronunciation sequence for name text information and pronunciation sequence information are not previously set.

이름 발음 정보 생성 단계는, 이름 발음 정보 생성 모듈(11)이 발음 통일 단계의 수정 발음열 정보를 결합하여 이름 발음 정보를 생성하는 단계이다. 예를 들어, 수정 발음열 정보가 'ㅇ ㅣ ㅁ ㅐ ㅇ ㅕ ㄴ'인 경우, '이매연'으로 이름 발음 정보를 생성하게 된다. The name pronunciation information generation step is a step in which the name pronunciation information generation module 11 combines the corrected pronunciation sequence information in the pronunciation unification step to generate name pronunciation information. For example, if the corrected pronunciation sequence information is'ㅇ ㅣ ㅁ ㅐ ㅇ ㅕㄴ', name pronunciation information is generated as'Imaeyeon'.

본 발명의 일실시예에 따른 이름 발음 정보 생성 모듈(11)은 기저장된 발음 규칙을 기초로 아래의 단계로 이름 발음 정보를 생성하도록 구성될 수 있다. The name pronunciation information generation module 11 according to an embodiment of the present invention may be configured to generate name pronunciation information in the following steps based on a pre-stored pronunciation rule.

① [음소 단위 분해 단계] 수신된 이름 텍스트 정보를 음소 단위인 자모로 분해. 예를 들어, '김민아' -> 'ㄱ ㅣ ㅁ ....'① [Phoneme unit decomposition step] The received name text information is decomposed into phoneme units. For example,'Minah Kim' ->'ㄱ ㅣ ㅁ ....'

② [발음 규칙 적용 단계] 기저장된 발음 규칙을 기초로 발음열 변환. 예를 들어, 경음화 법칙('이국비[이국삐]', '김국증[김국쯩]'), 자음동화, 구개음화 등의 표준어 규정 제2부의 표준발음법의 발음 규칙을 기초로 함.② [Pronunciation rule application stage] Converts phonetic sequence based on pre-stored phonetic rules. For example, it is based on the pronunciation rules of the standard pronunciation of Part 2 of the standard language regulations such as the law of hard phonetics ('Igukbi [exotic beep]','Kim Gukjeung [Kim Guk-tung]'), consonant assimilation, and palatal sounding.

③ [발음 통일 단계] 발음상 크게 차이가 없는 발음들을 통일. 예를 들어, 'ㅞ/ㅚ' 발음을 'ㅙ' 발음으로 통일함. ③ [Pronunciation unification stage] Unification of pronunciations that are not significantly different in pronunciation. For example,'ㅞ/ㅚ' pronunciation is unified with'ㅙ' pronunciation.

④ [이름 발음 정보 생성 단계] 자모로 분해된 텍스트 정보를 결합하여 이름 발음 정보를 생성.④ [Name pronunciation information generation step] Name pronunciation information is generated by combining text information decomposed into letters.

이에 따르면, 언어 특성 벡터 생성 모듈(12), 음향 특성 벡터 생성 모듈(13), 이름 음성 정보 생성 모듈(14)에 포함된 인공신경망의 연산이 최적화되어 이름 음성 정보 생성 성능이 향상되고, 학습 시간이 저감되며, 특정 수준의 이름 음성 정보 생성 성능을 위해 필요한 학습 데이터의 양이 저감되는 효과가 발생된다. According to this, the operation of the artificial neural network included in the language characteristic vector generation module 12, the acoustic characteristic vector generation module 13, and the name speech information generation module 14 is optimized, thereby improving the performance of generating name speech information, and learning time. This is reduced, and the amount of learning data required for a specific level of name voice information generation performance is reduced.

본 발명의 다른 실시예에 따른 이름 발음 정보 생성 모듈(11)은 이름 텍스트 정보를 입력 데이터로 하고 이름 발음 정보를 출력 데이터로 하는 기학습된 RNN(LSTM 셀 또는 GRU 셀 등) 구조의 인공신경망 모듈을 포함하도록 구성될 수 있다. The name pronunciation information generation module 11 according to another embodiment of the present invention is an artificial neural network module having a pre-learned RNN (LSTM cell or GRU cell, etc.) structure in which name text information is input data and name pronunciation information is output data. It may be configured to include.

이름 텍스트 정보가 영어 등의 외국어인 경우, 이름 발음 정보 생성 모듈(11)은 음소 단위 분해 단계 이전에 외국어의 이름 텍스트 정보를 한글의 이름 텍스트 정보로 전환하는 한글 전환 단계를 더 포함하여 수행하도록 구성될 수 있다. 이름 발음 정보 생성 모듈(11)의 한글 전환 단계는 외국어의 이름 텍스트 정보를 외래어 표기법에 따라 한글 발음열로 전환하여 한글의 이름 텍스트 정보를 생성하게 되고, 생성된 한글의 이름 텍스트 정보를 이용하여 음소 단위 분해 단계 이후의 단계를 수행하도록 구성된다. 이에 따르면, 이름 발음 정보 생성 모듈(11)을 외국어에 따라 별도로 구성하지 않아도 되는 효과가 발생된다. 본 발명의 일실시예에 따라 특정 외국어에 대하여 별도의 이름 발음 정보 생성 모듈(11)을 구성하지 않게 되면, 각 언어의 다양한 발음 규칙 때문에 모듈의 복잡도가 증가되고 성능이 저감되는 문제가 발생되지 않는 효과가 있다. When the name text information is in a foreign language such as English, the name pronunciation information generation module 11 is configured to further include a Hangul conversion step of converting the name text information of the foreign language into the Hangul name text information before the phoneme unit decomposition step. Can be. In the Korean conversion step of the name pronunciation information generation module 11, the name text information of the foreign language is converted into a Korean pronunciation string according to the foreign language notation to generate the name text information of the Korean language, and the phoneme using the generated name text information of the Korean language. It is configured to perform a step after the unit decomposition step. According to this, the effect of not having to separately configure the name pronunciation information generation module 11 according to the foreign language occurs. If a separate name pronunciation information generation module 11 is not configured for a specific foreign language according to an embodiment of the present invention, the complexity of the module increases and performance is reduced due to various pronunciation rules of each language. It works.

특히 영어와 같이 표음성이 낮은 언어에 대해서는 한글 전환 인공신경망이 이름 발음 정보 생성 모듈(11)에 포함되어 이용될 수 있다. 한글 전환 인공신경망은 입력 데이터를 외국어의 이름 텍스트 정보를 입력 데이터로 하고 출력 데이터를 한글의 이름 텍스트 정보로 하는 GRU(Gated recurrent unit) 셀 또는 LSTM(Long-Short Term Memory) 셀을 포함하는 순환신경망(RNN, Recurrent Neural Network)으로 구성될 수 있으며, 학습 단계에서는 입력된 외국어의 이름 텍스트 정보에 의해 한글 전환 인공신경망에서 출력된 한글의 이름 텍스트 정보와 외래어 표기법을 기초로 생성된 한글의 이름 텍스트 정보(Ground truth)와의 차이를 기초로 한 Back Propagation 등의 방법으로 인공신경망을 학습시킬 수 있다. 이에 따르면, 영어와 같이 표음성이 낮은 언어에 대해서도 외국어의 이름 텍스트 정보를 한글의 이름 텍스트 정보로 전환할 수 있게 되는 효과가 발생된다. 이에 따르면, 'though[도]', 'through[스루]', 'rough[러프]', 'cough[코프]', 'thought[소트]', 'bough[바우]' 등과 같이 'ough'에 대해 다양한 발음을 가지는 영어의 낮은 표음성에도 불구하고 영어를 한글의 발음열로 구성된 이름 발음 정보로 생성할 수 있게 되는 효과가 발생된다. 또한, 영어의 모음 음소에 대한 모음비음화 규칙, 모음 상승 규칙, 모음 하강 규칙, 모음 무성화 규칙, 슈와융합 규칙, 그리고 영어의 자음 음소에 대한 기식음화 규칙, 연구개음화 규칙, 음절성 자음화 규칙, 비개방음화 규칙, 치음화 규칙, 탄설음화 규칙, 경음화 규칙, 원순음화 규칙, 무성음화 규칙, 유성음화 규칙, 순치음화 규칙 등의 복잡한 발음 규칙을 이름 발음 정보 생성 모듈(11)에 구성하지 않아도 영어의 발음열을 출력할 수 있게 되는 효과가 발생된다.In particular, for languages with low phonetic voice, such as English, an artificial neural network for converting to Korean may be included in the name pronunciation information generating module 11 and used. The Korean-switched artificial neural network is a circulatory neural network including a GRU (Gated recurrent unit) cell or LSTM (Long-Short Term Memory) cell in which the input data is the name text information of a foreign language as the input data and the output data is the name text information of the Korean language. It can be composed of (RNN, Recurrent Neural Network), and in the learning stage, the name text information of the Hangul generated based on the Hangul name text information and the foreign word notation output from the Hangul conversion artificial neural network based on the name text information of the input foreign language. Artificial neural networks can be trained by methods such as Back Propagation based on the difference from (Ground truth). According to this, even for a language having a low phonetic voice, such as English, name text information of a foreign language can be converted into name text information of Korean. According to this,'though[do]','through[through]','rough[rough]','cough[cope]','thought[sort]','bough[bow]', etc. On the other hand, despite the low phoneticity of English having various pronunciations, the effect of generating English as name pronunciation information composed of a pronunciation string of Hangul is generated. In addition, vowel nasalization rules for vowel phonemes in English, vowel rising rules, vowel descending rules, vowel unsynchronization rules, Shuwa fusion rules, and asymmetric phonetic rules for consonant phonemes in English, soft phonetic rules, syllable consonant rules, It is not necessary to configure complex pronunciation rules such as closed phonetic rules, chirping rules, tongue phonetic rules, hard phonetic rules, original phonetic rules, unvoiced phonetic rules, voiced phonetic rules, and pure tooth phonetic rules in the name pronunciation information generation module (11). There is an effect of being able to output the pronunciation string of English.

언어 특성 벡터 생성 모듈(12)은 상기 이름 발음 정보 생성 모듈(11)에서 생성된 이름 발음 정보를 수신하고, 이름 발음 정보를 기초로 이름 발음 정보의 텍스트에 대한 언어 특성 벡터를 생성하는 인코딩 모듈이다. 도 5는 언어 특성 벡터 생성 모듈(12)의 언어 특성 벡터 생성을 도시한 모식도이다. 도 5에 도시된 바와 같이, 본 발명의 일실시예에 따른 언어 특성 벡터 생성 모듈(12)은 아래의 단계로 언어 특성 벡터를 생성하도록 구성될 수 있다. The language characteristic vector generation module 12 is an encoding module that receives the name pronunciation information generated by the name pronunciation information generation module 11 and generates a language characteristic vector for the text of the name pronunciation information based on the name pronunciation information. . 5 is a schematic diagram showing the language feature vector generation by the language feature vector generation module 12. As shown in FIG. 5, the language feature vector generation module 12 according to an embodiment of the present invention may be configured to generate a language feature vector in the following steps.

① 이름 발음 정보의 자모를 분해하고, 각 자모를 임베딩하여 각각의 자모에 대한 기설정된 캐릭터 임베딩 정보(Character embedding data)인 자모 벡터 생성. ① By decomposing the alphabet of the name pronunciation information and embedding each character, a character vector, which is preset character embedding data, is generated for each character.

② 자모 벡터를 입력데이터로 하고, 해당 이름 발음 정보를 가장 잘 나타내는 벡터를 의미하는 각 자모에 대한 텍스트 임베딩 정보(Text embedding data)인 언어 특성 벡터를 출력데이터로 하는 기학습된 인공신경망 모듈(임베딩 인공신경망)에 ①단계의 자모 벡터를 입력.② A pre-learned artificial neural network module (embedding) that uses a character vector as input data and uses a language feature vector, which is text embedding data, for each character, which means the vector that best represents the pronunciation information of the corresponding name. Artificial neural network), input the letter vector of step ①.

③ 이름 발음 정보에 대한 텍스트 임베딩 정보인 언어 특성 벡터를 출력.③ Output language feature vector, which is text embedding information for name pronunciation information.

본 발명의 일실시예에 따르면, ②단계의 기학습된 인공신경망 모듈(임베딩 인공신경망)은 FC(Fully Connected Layer)-ReLU-Dropout-FC-ReLU-Dropout의 n개의 FC Layer 구조로 구성된 Pre-net, n-D Convolution layer, Bidirectional RNN 모듈(forward sequence와 backward sequence를 결합하여 출력)이 연결되어 이름 발음 정보에 대한 텍스트 임베딩 정보인 언어 특성 벡터가 출력되도록 구성될 수 있다. According to an embodiment of the present invention, the pre-learned artificial neural network module (embedding artificial neural network) in step ② is composed of n FC layer structures of FC (Fully Connected Layer)-ReLU-Dropout-FC-ReLU-Dropout. A net, an nD convolution layer, and a bidirectional RNN module (output by combining a forward sequence and a backward sequence) may be connected to output a language feature vector, which is text embedding information for name pronunciation information.

이때, Pre-net은 dropout 기법을 적용한 2층의 fully connected layer로서 과적합을 방지하고 트레이닝이 수렴하는 것을 돕는 네트워크이다.At this time, the Pre-net is a fully connected layer of 2 layers to which the dropout technique is applied, and is a network that helps prevent overfitting and converges training.

특히, n-D Convolution layer는 1D Convolution Bank를 이용하여 unigram 부터 K-gram까지의 길이를 가지는 필터로 상기 이름 발음 정보를 convolution하고, 그 결과들을 Stacking하도록 구성될 수 있다. 이후 local invariance를 향상시키기 위해 max pooling이 구성되고, 이때 시간 축의 해상도를 유지하기 위해 stride=1로 구성될 수 있다. 이후의 high level feature들을 추출하기 위해 몇 층의 1D Convolution을 연산한 뒤 residual connection을 적용하여 1D Convolution의 결과에 상기 이름 발음 정보를 더한 값이 highway network를 거쳐서 Bidirectional RNN 모듈에 입력되도록 구성될 수 있다. 이때, Highway network는 한 층의 신경망을 거친 결과와 최초 입력 정보인 상기 이름 발음 정보를 weighted sum 하는 구조로 구성될 수 있다. In particular, the n-D convolution layer may be configured to convolution the name pronunciation information with a filter having a length from unigram to K-gram using a 1D convolution bank and stack the results. Thereafter, max pooling is configured to improve local invariance, and at this time, stride=1 may be configured to maintain the resolution of the time axis. In order to extract later high level features, a value obtained by adding the name pronunciation information to the result of 1D convolution by calculating several layers of 1D convolution and then applying residual connection can be configured to be input to the bidirectional RNN module through the highway network. . In this case, the Highway network may be configured in a structure of weighted sum of the result of passing through a layer of neural networks and the first input information, name pronunciation information.

음향 특성 벡터 생성 모듈(13)은 언어 특성 벡터 생성 모듈(12)의 언어 특성 벡터를 수신하고, 언어 특성 벡터를 기초로 스펙트로그램 정보 등의 음향 특성(시간, 주파수, 진폭)에 관한 음향 특성 벡터를 생성하는 디코딩 모듈이다. 보다 구체적으로, 음향 특성 벡터 생성 모듈(13)은 특정 time step 프레임의 음향 특성 벡터를 입력으로 받고, 상기 언어 특성 벡터를 기초로 콘텍스트 모듈(context module)에서 생성되는 콘텍스트 벡터(context vector)와 콘텍스트 인공신경망 모듈의 hidden state vector의 concatenate된 벡터를 기초로 다음 time step 프레임의 음향 특성 벡터를 출력하도록 구성될 수 있다. 특히, 본 발명의 일실시예에 따른 음향 특성 벡터 생성 모듈(13)은 time step 당 하나가 아닌 여러 프레임의 음향 특성 벡터를 출력함으로써 트레이닝 시간, 합성 시간, 모델 사이즈를 저감시킬 수 있다. The acoustic characteristic vector generation module 13 receives the language characteristic vector of the language characteristic vector generation module 12, and based on the language characteristic vector, an acoustic characteristic vector related to acoustic characteristics (time, frequency, amplitude) such as spectrogram information It is a decoding module that generates More specifically, the acoustic characteristic vector generation module 13 receives an acoustic characteristic vector of a specific time step frame as an input, and a context vector and a context generated in a context module based on the language characteristic vector. It can be configured to output the acoustic characteristic vector of the next time step frame based on the concatenate vector of the hidden state vector of the artificial neural network module. In particular, the acoustic characteristic vector generation module 13 according to an embodiment of the present invention may reduce training time, synthesis time, and model size by outputting acoustic characteristic vectors of several frames instead of one per time step.

학습 세션에서는, 음향 특성 벡터 생성 모듈(13)은 이름 음성 정보 병합 모듈(15)에서 병합될 특정 화자의 음성 정보/영상 정보에 대한 언어 특성 벡터를 입력 정보로 하고 및 해당 특성 화자의 음성 정보를 출력 정보 및 Ground truth로 하여 학습될 수 있다. In the learning session, the acoustic characteristic vector generation module 13 uses the language characteristic vector for the speech information/video information of the specific speaker to be merged in the name speech information merging module 15 as input information and the speech information of the specific speaker. It can be learned with output information and ground truth.

도 6은 본 발명의 일실시예에 다른 음향 특성 벡터 생성 모듈(13)의 음향 특성 벡터 생성을 도시한 모식도이다. 도 6에 도시된 바와 같이, 본 발명의 일실시예에 따른 음향 특성 벡터 생성 모듈(13)은, 콘텍스트 모듈(context module), 콘텍스트 인공신경망 모듈(RNN), 디코더 인공신경망 모듈(RNN), 멜 스펙트로그램 인공신경망 모듈을 포함할 수 있고, 아래의 단계로 음향 특성 벡터를 생성하도록 구성될 수 있다. 6 is a schematic diagram showing the generation of an acoustic characteristic vector by the acoustic characteristic vector generation module 13 according to an embodiment of the present invention. 6, the acoustic characteristic vector generation module 13 according to an embodiment of the present invention includes a context module, a context artificial neural network module (RNN), a decoder artificial neural network module (RNN), and a mel It may include a spectrogram artificial neural network module, and may be configured to generate an acoustic characteristic vector in the following steps.

① 콘텍스트 모듈이 언어 특성 벡터 및 콘텍스트 인공신경망 모듈(이전 시기의 음향 특성 벡터의 적어도 일부 스펙트로그램 정보가 입력 벡터)의 히든 스테이트 벡터(hidden state vector) 및 콘텍스트 인공신경망 모듈의 출력 벡터를 입력받아 콘텍스트 벡터(context vector)를 출력.① The context module receives the language characteristic vector and the hidden state vector of the context artificial neural network module (at least some spectrogram information of the acoustic characteristic vector of the previous period is the input vector) and the output vector of the context artificial neural network module and receives the context. Output a vector (context vector).

② 디코더 인공신경망 모듈이 콘텍스트 벡터 및 언어 특성 벡터를 결합(Concatenation)한 입력 벡터를 Sequence의 형태로 입력받고, 음향 특성 벡터를 Sequence의 형태로 출력. 특히, 디코더 인공신경망 모듈은 순환 인공신경망(RNN)으로 구성되어 Sequence 형태의 입력 벡터가 입력된 이후부터는 이전 시기의 음향 특성 벡터의 적어도 일부 스펙트로그램 정보(이전 시기가 t=initial 인 경우에는 <go>)가 입력 벡터로서 입력되어 현재 시기의 음향 특성 벡터를 출력.② The decoder artificial neural network module receives the input vector that concatenates the context vector and the language feature vector in the form of a sequence, and outputs the acoustic feature vector in the form of a sequence. In particular, the decoder artificial neural network module consists of a cyclic artificial neural network (RNN), and after inputting a sequence-type input vector, at least some spectrogram information of the acoustic characteristic vector of the previous period (if the previous period is t=initial, <go >) is input as an input vector to output the acoustic characteristic vector at the present time.

③ 멜 스펙트로그램 인공신경망 모듈(CNN, Convolutional Neural Network)이 음향 특성 벡터를 입력받아 상기 음향 특성 벡터에 대한 보코더 입력 정보인 멜 스펙트로그램(Mel-Spectrogram)을 생성③ Mel Spectrogram An artificial neural network module (CNN, Convolutional Neural Network) receives an acoustic characteristic vector and generates a Mel-Spectrogram, which is the vocoder input information for the acoustic characteristic vector.

이때, 예를 들어, 본 발명의 일실시예에 따른 콘텍스트 인공신경망 모듈은 256-unit GRU 1 Layer로 구성될 수 있고, 디코더 인공신경망 모듈은 모델의 보다 빠른 수렴을 위하여 residual connection을 포함한 256-unit GRU 2 Layer로 구성될 수 있다. At this time, for example, the context artificial neural network module according to an embodiment of the present invention may be composed of a 256-unit GRU 1 layer, and the decoder artificial neural network module is a 256-unit including residual connection for faster model convergence. It can be composed of 2 GRU layers.

본 발명의 일실시예에 따른 콘텍스트 벡터는, 디코더 인공신경망 모듈에 입력되는 언어 특성 벡터의 Sequence 중 일부에 대한 Sequence 형태의 가중치 벡터를 의미하고, 디코더 인공신경망 모듈에서 출력되는 현재 시기의 음향 특성 벡터가 언어 특성 벡터의 Sequence 중 어느 일부에 의해 주로 출력될지를 결정하게 되어 기학습되지 않은 이름 텍스트 정보에 대해서도 음성 특성 벡터를 안정적으로 출력할 수 있게 되는 효과가 발생된다. The context vector according to an embodiment of the present invention means a weight vector in the form of a sequence for a part of a sequence of language characteristic vectors input to the decoder artificial neural network module, and an acoustic characteristic vector of the current time output from the decoder artificial neural network module. It is determined by which part of the sequence of the language feature vector is mainly output, so that the voice feature vector can be stably output even for the name text information that has not been previously learned.

본 발명의 일실시예에 따른 멜 스펙트로그램 인공신경망 모듈은 n-layer의 ConvNet(CNN, Convolutional Neural Network)으로 구성되어 입력 데이터를 음향 특성 벡터로 하고, 출력 데이터를 해당 음향 특성 벡터의 멜 스펙트로그램(보코더 입력 정보)으로 하는 인공신경망 모듈이다. Mel spectrogram artificial neural network module according to an embodiment of the present invention is composed of an n-layer ConvNet (CNN, Convolutional Neural Network), input data as an acoustic characteristic vector, and output data is the mel spectrogram of the corresponding acoustic characteristic vector It is an artificial neural network module with (vocoder input information).

본 발명에서는 설명의 편의를 위하여 멜 스케일의 스펙트로그램인 멜 스펙트로그램으로 기재하였으나 본 발명의 범위는 이에 한정되지 않으며, 본 발명의 일실시예에 따른 보코더 입력 정보는 멜 스펙트로그램에 한정되지 않으며, mel-filterbank를 거치지 않은 기본적인 spectrogram, 스펙트럼, Fundamental frequency를 의미하는 f0 등 Fourier Transform을 활용한 주파수 정보, 신호에서의 비주기성 구성요소와 음성 신호간 비율을 의미하는 aperiodicity 등을 포함할 수 있다. In the present invention, for convenience of explanation, the mel spectrogram, which is a mel scale spectrogram, is described, but the scope of the present invention is not limited thereto, and vocoder input information according to an embodiment of the present invention is not limited to the mel spectrogram. It may include basic spectrogram without a mel-filterbank, spectrum, frequency information using Fourier Transform, such as f0, which means fundamental frequency, and aperiodicity, which means the ratio between the aperiodic component in the signal and the voice signal.

이름 음성 정보 생성 모듈(14)은 디코더 인공신경망에서 출력된 음향 특성 벡터에 대한 보코더 입력 정보인 멜 스펙트로그램(Mel-Spectrogram)를 수신하고, 음향 특성 벡터의 멜 스펙트로그램(예를 들어, 80 밴드의 멜 스케일 스펙트로그램)을 Griffin-Lim, WaveNet 등의 음성 생성 모듈에 입력하여 이름 음성 정보(예를 들어, 1025차 선형 스케일 스펙트로그램)를 생성하는 보코더 모듈(vocoder module)이다. 즉, 이름 음성 정보 생성 모듈(14)은 멜 스케일의 음향 특성 벡터를 선형 스케일로 변환하기 위한 후처리 네트워크이다. The name voice information generation module 14 receives a Mel-Spectrogram, which is the vocoder input information for the acoustic feature vector output from the decoder artificial neural network, and receives the Mel-Spectrogram of the acoustic feature vector (e.g., 80 bands). This is a vocoder module that generates name voice information (for example, 1025th linear scale spectrogram) by inputting the mel scale spectrogram of Griffin-Lim and WaveNet into a voice generating module such as Griffin-Lim and WaveNet. That is, the name voice information generation module 14 is a post-processing network for converting a mel scale acoustic characteristic vector into a linear scale.

이름 음성 정보 병합 모듈(15)은 이름 영역이 기설정된 특정 화자의 음성 정보 또는 상기 특정 화자의 영상 정보(병합 대상 콘텐츠 정보)의 이름 영역에 이름 음성 정보 생성 모듈(14)에서 생성된 이름 음성 정보를 삽입하여 이름 음성 정보와 해당 특정 화자의 음성 정보 또는 영상 정보를 병합하여 병합 콘텐츠 정보를 생성하는 모듈이다. 도 7은 본 발명의 일실시예에 따른 이름 음성 정보 병합 모듈(15)의 이름 음성 정보의 병합을 도시한 모식도이다. 본 발명의 일실시예에 따른 이름 영역은, 특정 화자의 음성 정보 또는 영상 정보 내에서 이름 음성 정보가 삽입될 영역의 시작 시간과 종료 시간으로 구성될 수 있다. The name voice information merging module 15 is the name voice information generated by the name voice information generating module 14 in the name field of the voice information of a specific speaker whose name area is preset or the video information of the specific speaker (content information to be merged). It is a module that generates merged content information by inserting and merging voice information of a name and voice information or video information of a specific speaker. 7 is a schematic diagram showing the merger of name and voice information of the name and voice information merging module 15 according to an embodiment of the present invention. The name region according to an embodiment of the present invention may be composed of a start time and an end time of a region into which the name voice information is to be inserted within the voice information or video information of a specific speaker.

본 발명의 일실시예에 따르면, 사용자 또는 서버의 이름 텍스트 정보 입력에 의해, 입력된 이름을 특정 화자가 해당 특정 화자의 목소리로 호출/언급해주는 맞춤형 음성 정보 또는 영상 정보의 생성(콜미 서비스)이 가능해지는 효과가 발생된다.According to an embodiment of the present invention, the creation of customized voice information or video information in which a specific speaker calls / mentions the input name with the voice of the specific speaker by inputting the name text information of the user or server (colmi service). The effect that becomes possible occurs.

이상에서 설명한 바와 같이, 본 발명이 속하는 기술 분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 상술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함하는 것으로 해석되어야 한다.As described above, those skilled in the art to which the present invention pertains will understand that the present invention can be implemented in other specific forms without changing the technical spirit or essential features thereof. Therefore, the above-described embodiments are illustrative in all respects and should be understood as non-limiting. The scope of the present invention is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and equivalent concepts should be interpreted as being included in the scope of the present invention.

본 명세서 내에 기술된 특징들 및 장점들은 모두를 포함하지 않으며, 특히 많은 추가적인 특징들 및 장점들이 도면들, 명세서, 및 청구항들을 고려하여 당업자에게 명백해질 것이다. 더욱이, 본 명세서에 사용된 언어는 주로 읽기 쉽도록 그리고 교시의 목적으로 선택되었고, 본 발명의 주제를 묘사하거나 제한하기 위해 선택되지 않을 수도 있다는 것을 주의해야 한다.The features and advantages described herein are not all inclusive, and in particular many additional features and advantages will become apparent to those skilled in the art in view of the drawings, specification, and claims. Moreover, it should be noted that the language used herein has been selected primarily for readability and for teaching purposes, and may not be chosen to describe or limit the subject matter of the invention.

본 발명의 실시예들의 상기한 설명은 예시의 목적으로 제시되었다. 이는 개시된 정확한 형태로 본 발명을 제한하거나, 빠뜨리는 것 없이 만들려고 의도한 것이 아니다. 당업자는 상기한 개시에 비추어 많은 수정 및 변형이 가능하다는 것을 이해할 수 있다.The above description of the embodiments of the present invention has been presented for purposes of illustration. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Those skilled in the art will understand that many modifications and variations are possible in light of the above disclosure.

그러므로 본 발명의 범위는 상세한 설명에 의해 한정되지 않고, 이를 기반으로 하는 출원의 임의의 청구항들에 의해 한정된다. 따라서, 본 발명의 실시예들의 개시는 예시적인 것이며, 이하의 청구항에 기재된 본 발명의 범위를 제한하는 것은 아니다.Therefore, the scope of the present invention is not limited by the detailed description, but by any claims in the application on which it is based. Accordingly, the disclosure of the embodiments of the present invention is illustrative and does not limit the scope of the present invention described in the following claims.

1: 콜미 서비스를 위한 음성 합성 장치
10: 이름 텍스트 정보 수신 모듈
11: 이름 발음 정보 생성 모듈
12: 언어 특성 벡터 생성 모듈
13: 음향 특성 벡터 생성 모듈
14: 이름 음성 정보 생성 모듈
15: 이름 음성 정보 병합 모듈
100: 사용자 클라이언트
200: 서비스 서버1: Speech synthesis device for call-me service
10: Name text information receiving module
11: Name pronunciation information generation module
12: language feature vector generation module
13: Acoustic Characteristic Vector Generation Module
14: name voice information generation module
15: name voice information merge module
100: user client
200: service server

Claims

A name text information receiving module for receiving name text information for a user or a third party from a user client or a specific service server;
A name pronunciation information generation module for receiving the name text information in the name text information receiving module and generating name pronunciation information in which the received name text information is converted into a pronunciation string format;
A language characteristic vector generation module that receives the name pronunciation information generated by the name pronunciation information generation module, and generates a language characteristic vector for the text of the name pronunciation information based on the name pronunciation information;
An acoustic characteristic vector generation module for receiving the language characteristic vector of the language characteristic vector generation module and generating an acoustic characteristic vector based on the language characteristic vector;
A name speech information generation module receiving the acoustic characteristic vector output from the acoustic characteristic vector generation module and generating name speech information using the acoustic characteristic vector; And
The name voice information and the voice information of the specific speaker are merged by inserting the voice information generated by the name voice information generation module into the name field of the content information to be merged, which is the voice information or video information of a specific speaker whose name field is preset A name voice information merging module for merging target information and generating merged content information;
Including,
The name pronunciation information generation module, after converting the name text information into a phonetic string form, is generated based on a distance between the acoustic characteristic vectors output by the acoustic characteristic vector generation module for each phoneme. To generate the name pronunciation information by unifying,
Calling the name text information with the voice characteristic of the specific speaker, and configuring a call-me service for generating the merged content information to output the merged content information,
Speech synthesis device for call me service.

The method of claim 1,
The acoustic characteristic vector generation module is an artificial neural network for inputting a context vector generated based on the language characteristic vector and the acoustic characteristic vector of a specific frame, and outputting the acoustic characteristic vector of a frame after the specific frame. Characterized in that consisting of,
Speech synthesis device for call me service.

delete

The method of claim 1,
When the above name text information is composed of a foreign language other than Korean,
The name pronunciation information generating module, after converting the name text information of a foreign language into the name text information of Korean, converts the name text information into a pronunciation string form to generate the name pronunciation information,
Speech synthesis device for call me service.

A name text information receiving step of receiving, by a name text information receiving module, name text information for a user or a third party from a user client or a specific service server;
A name pronunciation information generation step of receiving the name text information from the name text information receiving module by the name pronunciation information generation module, and generating name pronunciation information obtained by converting the received name text information into a pronunciation string format;
A language characteristic vector generation step of receiving, by the language characteristic vector generation module, the name pronunciation information generated by the name pronunciation information generation module, and generating a language characteristic vector for the text of the name pronunciation information based on the name pronunciation information ;
An acoustic characteristic vector generation step of receiving, by an acoustic characteristic vector generation module, the language characteristic vector of the language characteristic vector generation module, and generating an acoustic characteristic vector based on the language characteristic vector;
A name voice information generation step of receiving, by a name voice information generating module, the acoustic feature vector output from the acoustic feature vector generating module, and generating name voice information using the acoustic feature vector; And
The name audio information merging module inserts the name audio information generated by the name audio information generation module into the name area of the name area of the content information to be merged, which is audio information or video information of a specific speaker in which a name area is preset, and the name audio information A name voice information merging step of merging the merge target information of the specific speaker and generating merged content information;
Including,
In the name pronunciation information generation step, the name pronunciation information generation module converts the name text information into a pronunciation string form, and then calculates a distance between the acoustic characteristic vectors output by the acoustic characteristic vector generation module for each phoneme. The name pronunciation information is generated by unifying the similar pronunciation sequence generated on the basis,
Calling the name text information with the voice characteristic of the specific speaker, and configuring a call-me service for generating the merged content information to output the merged content information,
Speech synthesis method for call me service.

The method of claim 5,
In the acoustic characteristic vector generation step, the acoustic characteristic vector generation module inputs a context vector generated based on the language characteristic vector and the acoustic characteristic vector of a specific frame, and the acoustic characteristic vector of a frame after the specific frame Characterized in that consisting of an artificial neural network as an output,
Speech synthesis method for call me service.

delete

The method of claim 5,
When the above name text information is composed of a foreign language other than Korean,
In the name pronunciation information generating step, the name pronunciation information generating module converts the name text information of a foreign language into the name text information of Korean, and then converts the name text information into a pronunciation string form to generate the name pronunciation information. Characterized in that,
Speech synthesis method for call me service.