KR20200084414A

KR20200084414A - Method and system for generating voice montage

Info

Publication number: KR20200084414A
Application number: KR1020180167980A
Authority: KR
Inventors: 김남수; 이준엽; 천성준; 최병진
Original assignee: 서울대학교산학협력단
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2020-07-13
Also published as: KR102159988B1

Abstract

The present invention relates to a method for generating a voice montage and, more specifically, to a method for generating a voice montage using a multi-speaker voice synthesizer. The method comprises the steps of: (1) inputting a sentence; (2) setting a feature parameter for the sentence inputted in the step (1); (3) generating a voice montage using the feature parameters set in the step (2) and the multi-speaker voice synthesizer; and (4) outputting the voice montage generated in the step (3). According to the method for generating the voice montage proposed in the present invention, by using the multi-speaker voice synthesizer trained with a Hidden Markov Model (HMM) or Deep Learning, a voice for each speaker may be generated.

Description

METHOD AND SYSTEM FOR GENERATING VOICE MONTAGE}

본 발명은 음성 몽타주 생성 방법 및 시스템에 관한 것으로서, 보다 구체적으로는 다화자 음성 합성기를 이용한 음성 몽타주 생성 방법 및 시스템에 관한 것이다.The present invention relates to a method and system for generating a speech montage, and more particularly, to a method and system for generating a speech montage using a multi-speech speech synthesizer.

몽타주(montage)는 프랑스어 monter(모으다, 조합하다)에서 유래한 용어로서, 영상, 사진 등 시각적인 매체를 떼어 붙여 새로운 영상, 이미지, 그림 등을 만들어내는데 사용된다. 이러한 몽타주의 개념은 범죄수사학적인 관점에서 경찰의 수사과정에서 도주한 용의자의 인상착의 등을 피해자의 기억에 의존한 설명만으로 재구성하여 그려내는 방식에도 사용된다.
Montage is a term derived from the French monter (collect, assemble) and is used to create new images, images, pictures, etc. by attaching visual media such as images and photos. This concept of montage is also used in a method of reconstructing and drawing the suspect's impression of escape from the police investigation process from a criminal rhetorical point of view based only on the memory of the victim.

음성 몽타주(voice montage)는 기존의 다화자가 등록되어 있는 음성 합성기를 이용하여 여러 화자의 음성 및 음성 신호 특징을 혼합하여 특정 사람의 목소리와 유사한 새로운 음성을 만들어 내는 것을 의미한다.
Voice montage means to create a new voice similar to the voice of a specific person by mixing the voice and voice signal characteristics of multiple speakers using a speech synthesizer in which an existing polyphony is registered.

음성 합성(speech synthesis)이란, 주어진 텍스트로부터 해당하는 사람의 음성을 만들어 내는 기술을 의미한다. 기존의 음편 조합 방식의 음성 합성 기법은, 수집된 음성 데이터베이스로부터 짧은 단위의 음편들을 저장한 후, 발화하고자 하는 문장의 텍스트에 해당하는 음편들을 연결하여 합성음을 만들어낸다. 음편 조합 방식은 음질이 좋다는 장점이 있지만, 수집된 음성 데이터베이스에 존재하지 않는 음편을 처리하기 어렵고, 음편 사이의 구간이 부자연스러우며, 음성 데이터베이스에 등록된 화자의 목소리만 사용할 수 있다는 단점이 있다.
Speech synthesis refers to a technique that produces a person's voice from a given text. In the existing speech combination method, the speech synthesis technique stores short-segment pieces from the collected speech database, and then connects the pieces corresponding to the text of the sentence to be uttered to create a synthesized sound. The method of combining the vowels has the advantage of good sound quality, but it has the disadvantages that it is difficult to process the vocals that do not exist in the collected speech database, the section between the vocals is unnatural, and only the voice of the speaker registered in the speech database can be used.

따라서, 데이터베이스에 존재하지 않는 화자의 목소리를 사용할 수 있는 음성 몽타주 생성 방법 및 시스템의 개발이 요구되고 있는 실정이다.
Accordingly, there is a need to develop a method and system for generating a voice montage that can use a speaker's voice that does not exist in the database.

한편, 본 발명과 관련된 선행기술로서, 등록특허 제10-1420557호(발명의 명칭: 파라미터 음성 합성 방법 및 시스템) 등이 개시된 바 있다.On the other hand, as a prior art related to the present invention, Patent No. 10-1420557 (invention name: parametric speech synthesis method and system) has been disclosed.

본 발명은 기존에 제안된 방법들의 상기와 같은 문제점들을 해결하기 위해 제안된 것으로서, 다화자 음성 합성기를 기반으로 각 화자의 각기 다른 특징 파라미터를 설정함으로써, 찾고자하는 용의자의 목소리와 유사한 음성을 합성하여 출력할 수 있는, 음성 몽타주 생성 방법 및 시스템을 제공하는 것을 그 목적으로 한다.
The present invention is proposed to solve the above problems of the previously proposed methods, and by setting different characteristic parameters of each speaker based on the multi-speech speech synthesizer, synthesizes speech similar to the suspect's voice to be searched for. An object of the present invention is to provide a method and system for generating a voice montage that can be output.

또한, 본 발명은, 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(Deep Learning)을 이용하여 다화자 음성 합성기를 학습시킴으로써, 빠르게 다화자 음성 합성기를 학습시키고, 출력되는 음성 몽타주의 정확도를 높일 수 있는, 음성 몽타주 생성 방법 및 시스템을 제공하는 것을 다른 목적으로 한다.
In addition, the present invention, by learning the multi-speech speech synthesizer using a hidden Markov Model (Hidden Markov Model, HMM) or deep learning, to quickly learn the multi-speech speech synthesizer, the accuracy of the output speech montage Another object is to provide a method and system for generating a voice montage that can be enhanced.

뿐만 아니라, 본 발명은, 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(Deep Learning)으로 학습된 다화자 음성 합성기를 사용함으로써, 각 화자의 음성을 만들 수 있을 뿐만 아니라, 두 개 이상의 음색을 혼합하여 목적으로 하는 화자의 음색을 효과적으로 합성하여 출력할 수 있는, 음성 몽타주 생성 방법 및 시스템을 제공하는 것을 또 다른 목적으로 한다.In addition, the present invention, by using a polyphonic speech synthesizer learned by a hidden Markov Model (Hidden Markov Model, HMM) or deep learning, can not only create the voice of each speaker, but also two or more voices Another object is to provide a method and system for generating a voice montage capable of effectively synthesizing and outputting a voice of a target speaker by mixing.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 음성 몽타주 생성 방법은,Method for generating a voice montage according to the features of the present invention for achieving the above object,

음성 몽타주 생성 방법으로서,As a method for generating a voice montage,

(1) 문장을 입력하는 단계;(1) inputting a sentence;

(2) 상기 단계 (1)에서 입력된 문장에 대해 특징 파라미터를 설정하는 단계;(2) setting feature parameters for the sentence input in step (1);

(3) 상기 단계 (2)에서 설정된 특징 파라미터 및 다화자 음성 합성기를 이용하여 음성 몽타주를 생성하는 단계; 및(3) generating a voice montage using the feature parameter set in step (2) and a multi-speech speech synthesizer; And

(4) 상기 단계 (3)에서 생성된 음성 몽타주를 출력하는 단계를 포함하는 것을 그 구성상의 특징으로 한다.
And (4) outputting the voice montage generated in step (3).

바람직하게는, 상기 단계 (2)에서의 특징 파라미터는,Preferably, the feature parameter in step (2) is:

화자, 감정 및 음성 스타일일 수 있다.
It can be a speaker, emotion and voice style.

더욱 바람직하게는, 상기 음성 스타일은,More preferably, the voice style,

음성의 높낮이, 음성의 속도, 음성의 크기 및 발음일 수 있다.
It may be the height of the voice, the speed of the voice, the size and pronunciation of the voice.

더욱 바람직하게는, 상기 단계 (2)는,More preferably, the step (2),

(2-1) 상기 단계 (1)에서 입력된 문장에 대해 화자를 설정하는 단계;(2-1) setting a speaker for the sentence input in step (1);

(2-2) 상기 단계 (2-1)에서 화자가 설정된 문장에 대해 감정을 설정하는 단계; 및(2-2) setting emotion for a sentence set by the speaker in step (2-1); And

(2-3) 상기 단계 (2-2)에서 감정이 설정된 문장에 대해 음성 스타일을 설정하는 단계를 포함할 수 있다.
(2-3) In step (2-2), a voice style may be set for a sentence in which emotion is set.

더더욱 바람직하게는, 상기 단계 (3)은,Even more preferably, the step (3),

(3-1) 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(deep learning)을 이용하여 다화자 음성 합성기를 학습시키는 단계; 및(3-1) training a multi-speaker speech synthesizer using a Hidden Markov Model (HMM) or deep learning; And

(3-2) 상기 단계 (2-1) 내지 상기 단계 (2-3)을 통해 설정된 특징 파라미터와 상기 단계 (3-1)에서 학습된 다화자 음성 합성기를 이용하여 음성 몽타주를 생성하는 단계를 포함할 수 있다.
(3-2) generating a voice montage using the feature parameter set through the steps (2-1) to (2-3) and the polyphonic speech synthesizer learned in the step (3-1). It can contain.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 음성 몽타주 생성 시스템은,Voice montage generation system according to the features of the present invention for achieving the above object,

음성 몽타주 생성 시스템으로서,As a voice montage generation system,

문장을 입력하는 입력부;An input unit for inputting a sentence;

상기 입력부에 의해 입력된 문장에 대해 특징 파라미터를 설정하는 파라미터 설정부;A parameter setting unit for setting feature parameters for sentences entered by the input unit;

상기 파라미터 설정부에 의해 설정된 특징 파라미터 및 다화자 음성 합성기를 이용하여 음성 몽타주를 생성하는 음성 몽타주 생성부; 및A voice montage generation unit for generating a voice montage using the feature parameter set by the parameter setting unit and a multi-speaker speech synthesizer; And

상기 음성 몽타주 생성부에 의해 생성된 음성 몽타주를 출력하는 출력부를 포함하는 것을 그 구성상의 특징으로 한다.
And an output unit for outputting the voice montage generated by the voice montage generator.

바람직하게는, 상기 특징 파라미터는,Preferably, the feature parameter,

더욱 바람직하게는, 상기 파라미터 설정부는,More preferably, the parameter setting unit,

상기 입력부에 의해 입력된 문장에 대해 화자를 설정하는 화자 설정 모듈;A speaker setting module for setting a speaker for a sentence input by the input unit;

상기 화자 설정 모듈에 의해 화자가 설정된 문장에 대해 감정을 설정하는 감정 설정 모듈; 및An emotion setting module for setting emotions on sentences set by the speaker by the speaker setting module; And

상기 감정 설정 모듈에 의해 감정이 설정된 문장에 대해 음성 스타일을 설정하는 음성 스타일 설정 모듈을 포함할 수 있다.
It may include a voice style setting module for setting a voice style for the sentence with the emotion set by the emotion setting module.

더더욱 바람직하게는, 상기 음성 몽타주 생성부는,Even more preferably, the voice montage generation unit,

은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(deep learning)을 이용하여 다화자 음성 합성기를 학습시키는 학습 모듈; 및A learning module that trains a multi-speaker speech synthesizer using a Hidden Markov Model (HMM) or deep learning; And

상기 화자 설정 모듈, 상기 감정 설정 모듈 및 상기 음성 스타일 설정 모듈을 통해 설정된 특징 파라미터와 상기 학습 모듈에 의해 학습된 다화자 음성 합성기를 이용하여 음성 몽타주를 생성하는 음성 몽타주 생성 모듈을 포함할 수 있다.It may include a voice montage generation module for generating a speech montage using the speaker setting module, the feature setting parameters set through the emotion setting module and the speech style setting module and the multi-speaker speech synthesizer learned by the learning module.

본 발명에서 제안하고 있는 음성 몽타주 생성 방법 및 시스템에 따르면, 다화자 음성 합성기를 기반으로 각 화자의 각기 다른 특징 파라미터를 설정함으로써, 찾고자하는 용의자의 목소리와 유사한 음성을 합성하여 출력할 수 있다.
According to the method and system for generating a voice montage proposed in the present invention, by setting different characteristic parameters of each speaker based on a multi-speech speech synthesizer, a voice similar to the suspect's voice to be searched can be synthesized and output.

또한, 본 발명에서 제안하고 있는 음성 몽타주 생성 방법 및 시스템에 따르면, 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(Deep Learning)을 이용하여 다화자 음성 합성기를 학습시킴으로써, 빠르게 다화자 음성 합성기를 학습시키고, 출력되는 음성 몽타주의 정확도를 높일 수 있다.
In addition, according to the method and system for generating a voice montage proposed in the present invention, a multi-speech speech synthesizer is rapidly learned by learning a multi-speech speech synthesizer using a hidden markov model (HMM) or deep learning. By learning, it is possible to increase the accuracy of the output voice montage.

뿐만 아니라, 본 발명에서 제안하고 있는 음성 몽타주 생성 방법 및 시스템에 따르면, 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(Deep Learning)으로 학습된 다화자 음성 합성기를 사용함으로써, 각 화자의 음성을 만들 수 있을 뿐만 아니라, 두 개 이상의 음색을 혼합하여 목적으로 하는 화자의 음색을 효과적으로 합성하여 출력할 수 있다.In addition, according to the method and system for generating a voice montage proposed by the present invention, by using a polyphonic speech synthesizer learned by a Hidden Markov Model (HMM) or Deep Learning, voices of each speaker Not only can you make it, but you can also mix two or more tones to effectively synthesize and output the intended speaker's tones.

도 1은 본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 흐름도를 도시한 도면.
도 2는 본 발명의 일실시예에 따른 음성 몽타주 생성 방법에서, 단계 S200의 세부적인 흐름을 도시한 도면.
도 3은 본 발명의 일실시예에 따른 음성 몽타주 생성 방법에서, 단계 S300의 세부적인 흐름을 도시한 도면.
도 4는 은닉 마르코프 모델(Hidden Markov Model, HMM)을 설명하기 위해 도시한 도면.
도 5는 인공신경망 모델 중 MLP(Multi-Layer Perceptron) 모델을 설명하기 위해 도시한 도면.
도 6은 딥 러닝(Deep Learning) 모델 중 RNN(Recurrent Neural Networks) 모델을 설명하기 위해 도시한 도면.
도 7은 딥 러닝(Deep Learning) 모델 중 LSTM(Long Short Term Memory) 모델을 설명하기 위해 도시한 도면.
도 8은 딥 러닝(Deep Learning) 모델 중 CNN(Convolutional Neural Network) 모델을 설명하기 위해 도시한 도면.
도 9는 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템의 구성을 도시한 도면.
도 10은 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템에 있어서, 파라미터 설정부의 세부적인 구성을 도시한 도면.
도 11은 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템에 있어서, 음성 몽타주 생성부의 세부적인 구성을 도시한 도면.1 is a flowchart illustrating a method for generating a voice montage according to an embodiment of the present invention.
2 is a view showing a detailed flow of step S200 in the method for generating a voice montage according to an embodiment of the present invention.
3 is a view showing a detailed flow of step S300 in the method for generating a voice montage according to an embodiment of the present invention.
4 is a view for explaining a Hidden Markov Model (HMM).
FIG. 5 is a diagram illustrating a multi-layer perceptron (MLP) model among artificial neural network models.
FIG. 6 is a diagram illustrating a Recurrent Neural Networks (RNN) model among deep learning models.
FIG. 7 is a diagram for explaining a Long Short Term Memory (LSTM) model among deep learning models.
FIG. 8 is a diagram illustrating a convolutional neural network (CNN) model among deep learning models.
9 is a diagram showing the configuration of a voice montage generation system according to an embodiment of the present invention.
10 is a diagram illustrating a detailed configuration of a parameter setting unit in a voice montage generation system according to an embodiment of the present invention.
11 is a diagram illustrating a detailed configuration of a voice montage generation unit in a voice montage generation system according to an embodiment of the present invention.

이하에서는 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예를 상세하게 설명함에 있어, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 또한, 유사한 기능 및 작용을 하는 부분에 대해서는 도면 전체에 걸쳐 동일 또는 유사한 부호를 사용한다.
Hereinafter, preferred embodiments will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present invention pertains can easily implement the present invention. However, in the detailed description of a preferred embodiment of the present invention, if it is determined that a detailed description of related known functions or configurations may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted. In addition, the same or similar reference numerals are used throughout the drawings for parts having similar functions and functions.

덧붙여, 명세서 전체에서, 어떤 부분이 다른 부분과 ‘연결’되어 있다고 할 때, 이는 ‘직접적으로 연결’되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고 ‘간접적으로 연결’되어 있는 경우도 포함한다. 또한, 어떤 구성요소를 ‘포함’한다는 것은, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.
In addition, in the entire specification, when a part is said to be'connected' to another part, it is not only'directly connected', but also'indirectly connected' with other elements in between. Includes. In addition, "including" a component means that other components may be further included instead of excluding other components, unless otherwise stated.

본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 각각의 단계는 컴퓨터 장치에 의해 수행될 수 있다. 이하에서는 설명의 편의를 위해 각각의 단계에서 수행 주체가 생략될 수도 있다.
Each step of the method for generating a voice montage according to an embodiment of the present invention may be performed by a computer device. Hereinafter, for convenience of description, the subject may be omitted in each step.

도 1은 본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 흐름도를 도시한 도면이다. 도 1에 도시된 바와 같이, 본 발명의 일실시예에 따른 음성 몽타주 생성 방법은, 음성 몽타주 생성 방법으로서, 문장을 입력하는 단계(S100), 단계 S100에서 입력된 문장에 대해 특징 파라미터를 설정하는 단계(S200), 단계 S200에서 설정된 특징 파라미터 및 다화자 음성 합성기를 이용하여 음성 몽타주를 생성하는 단계(S300), 및 단계 S300에서 생성된 음성 몽타주를 출력하는 단계(S400)를 포함하여 구현될 수 있다.
1 is a flowchart illustrating a method for generating a voice montage according to an embodiment of the present invention. As illustrated in FIG. 1, a method for generating a voice montage according to an embodiment of the present invention is a method for generating a voice montage, in which a sentence is input (S100) and a characteristic parameter is set for the sentence input in step S100. Step (S200), using the feature parameters set in step S200 and the multi-speech speech synthesizer to generate a voice montage (S300), and outputting the voice montage generated in step S300 (S400) may be implemented. have.

이하에서는, 본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 각각의 단계에 대해 상세히 설명하도록 한다.
Hereinafter, each step of the method for generating a voice montage according to an embodiment of the present invention will be described in detail.

단계 S100에서는, 문장을 입력할 수 있다. 보다 구체적으로는, 단계 S100에서는, 음성 몽타주로 출력하고자 하는 목소리의 문장을 입력할 수 있다. 이때, 음성 몽타주 사용자의 기억과 유사하게 음성 몽타주를 생성하기 위해서 기억하는 상황의 문장을 음성 합성 샘플로 활용하여 입력할 수 있다.
In step S100, a sentence can be input. More specifically, in step S100, a sentence of a voice to be output as a voice montage can be input. At this time, in order to generate a voice montage, similar to the voice montage user's memory, a sentence of a memory situation may be used as a voice synthesis sample and input.

단계 S200에서는, 단계 S100에서 입력된 문장에 대해 특징 파라미터를 설정할 수 있다. 도 2는 본 발명의 일실시예에 따른 음성 몽타주 생성 방법에서, 단계 S200의 세부적인 흐름을 도시한 도면이다. 도 2에 도시된 바와 같이, 본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 단계 S200은, 단계 S100에서 입력된 문장에 대해 화자를 설정하는 단계(S210), 단계 S210에서 화자가 설정된 문장에 대해 감정을 설정하는 단계(S220), 및 단계 S220에서 감정이 설정된 문장에 대해 음성 스타일을 설정하는 단계(S230)를 포함하여 구현될 수 있다.
In step S200, a feature parameter may be set for the sentence input in step S100. 2 is a diagram showing the detailed flow of step S200 in the method for generating a voice montage according to an embodiment of the present invention. As shown in FIG. 2, step S200 of the method for generating a voice montage according to an embodiment of the present invention includes setting a speaker for a sentence input in step S100 (S210 ), and setting a speaker in step S210 It may be implemented including the step (S220) of setting the emotion for the set, and the step (S230) of setting the speech style for the sentence in which the emotion is set in step S220.

본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 단계 S200에서의 특징 파라미터는, 화자, 감정 및 음성 스타일일 수 있으며, 보다 구체적으로는, 특징 파라미터는 화자, 감정, 음성의 높낮이, 음성의 속도, 음성의 크기 및 발음일 수 있다. 다만, 상기의 화자, 감정, 음성의 높낮이, 음성의 속도, 음성의 크기 및 발음으로 특징 파라미터를 한정하는 것은 아니다.
The feature parameter in step S200 of the method for generating a voice montage according to an embodiment of the present invention may be a speaker, emotion, and voice style, and more specifically, the feature parameter may be a speaker, emotion, voice pitch, voice speed. , It can be the size and pronunciation of the voice. However, the feature parameters are not limited to the speaker, emotion, voice height, voice speed, voice size, and pronunciation.

단계 S210에서는, 단계 S100에서 입력된 문장에 대해 화자를 설정할 수 있다. 보다 구체적으로는, 본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 단계 S210에서는, 음색에 중점을 두며, 설정된 화자의 음성 특징들을 평균적으로 반영하고, 가중치를 활용하여 단계 S100에서 입력된 문장에 대해 화자를 설정할 수 있다. 예를 들면, 성별, 나이대 등으로 단계 S100에서 입력된 문장에 대해 화자를 설정할 수 있다.
In step S210, a speaker may be set for the sentence input in step S100. More specifically, in step S210 of the method for generating a voice montage according to an embodiment of the present invention, focusing on the tone, reflecting the voice characteristics of the set speaker on average, and using weights to the sentence input in step S100 You can set the speaker. For example, the speaker may be set for the sentence input in step S100 by gender, age, and the like.

보다 구체적으로, 단계 S210에서는, 설정된 화자의 음성 특징들을 평균적으로 반영하여 생성한 합성음을 사용자에게 들려주고, 생성할 합성음이 선택된 화자들 중 어느 화자에 얼마나 더 가까워야 하는지에 대한 질의에 대한 답변을 사용자로부터 입력받으며, 입력받은 답변에 따라 화자 선택의 가중치를 결정할 수 있다. 이렇게 결정된 가중치를 반영하여 다시 생성한 합성음을 사용자에게 다시 들려주고, 가중치가 올바로 선택되었다고 판단될 때까지 반복적으로 시도함으로써, 사용자가 원하는 음성에 가까운 음성을 생성할 수 있다.
More specifically, in step S210, the synthesized sound generated by reflecting the voice characteristics of the set speaker on average is heard to the user, and an answer to the question of how much closer the synthesized sound to be generated should be to which of the selected speakers It is input from the user, and the weight of the speaker selection can be determined according to the inputted answer. The synthesized sound re-generated by reflecting the weight determined as described above is again heard by the user, and repeatedly attempted until it is determined that the weight is correctly selected, thereby generating a voice close to the voice desired by the user.

단계 S220에서는, 단계 S210에서 화자가 설정된 문장에 대해 감정을 설정할 수 있다. 보다 구체적으로는, 본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 단계 S220에서는, 음성 몽타주 사용자가 기억하는 상황의 감정을 설정하여, 최종적으로 출력되는 음성 몽타주가 목표하는 용의자의 음성과 비슷하도록 유도할 수 있다. 예를 들면, 분노, 슬픔, 기쁨 등의 감정을 설정할 수 있으며, 또한, 여러 감정을 혼합하여 단계 S210에서 화자가 설정된 문장에 대해 감정을 설정할 수 있다.
In step S220, emotions may be set for a sentence set in the speaker in step S210. More specifically, in step S220 of the method for generating a voice montage according to an embodiment of the present invention, the emotion of the situation that the voice montage user remembers is set so that the finally output voice montage is similar to the target suspect's voice. Can be induced. For example, emotions such as anger, sadness, and joy can be set, and emotions can also be set for a sentence set in step S210 by mixing various emotions.

단계 S230에서는, 단계 S220에서 감정이 설정된 문장에 대해 음성 스타일을 설정할 수 있다. 보다 구체적으로는, 본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 단계 S230에서는, 단계 S210 및 단계 S220에서 화자 및 감정이 설정된 문장에 대해, 음성의 높낮이, 음성의 속도, 음성의 크기 및 발음을 설정할 수 있다.
In step S230, a voice style may be set for a sentence in which emotion is set in step S220. More specifically, in step S230 of the method for generating a voice montage according to an embodiment of the present invention, for the sentence in which the speaker and emotion are set in steps S210 and S220, the pitch of the voice, the speed of the voice, and the volume and pronunciation of the voice You can set

단계 S300에서는, 단계 S200에서 설정된 특징 파라미터 및 다화자 음성 합성기를 이용하여 음성 몽타주를 생성할 수 있다. 도 3은 본 발명의 일실시예에 따른 음성 몽타주 생성 방법에서, 단계 S300의 세부적인 흐름을 도시한 도면이다. 도 3에 도시된 바와 같이, 본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 단계 S300은, 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(deep learning)을 이용하여 다화자 음성 합성기를 학습시키는 단계(S310), 및 단계 S210 내지 단계 S230을 통해 설정된 특징 파라미터와 단계 S310에서 학습된 다화자 음성 합성기를 이용하여 음성 몽타주를 생성하는 단계(S320)를 포함하여 구현될 수 있다.
In step S300, a voice montage may be generated using the feature parameter set in step S200 and the multi-speech speech synthesizer. 3 is a diagram showing the detailed flow of step S300 in the method for generating a voice montage according to an embodiment of the present invention. As shown in FIG. 3, step S300 of the method for generating a voice montage according to an embodiment of the present invention includes a multi-speech speech synthesizer using a hidden Markov Model (HMM) or deep learning. It may be implemented including a step (S310) of learning, and a step (S320) of generating a voice montage using the feature parameter set through steps S210 to S230 and the multi-speech speech synthesizer learned in step S310.

단계 S310에서는, 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(deep learning)을 이용하여 다화자 음성 합성기를 학습시킬 수 있다.
In step S310, a polyphonic speech synthesizer may be trained using a hidden Markov Model (HMM) or deep learning.

이하에서는, 본 발명의 일실시예에 따른 음성 몽타주 생성 방법에서 사용되는 은닉 마르코프 모델(Hidden Markov Model, HMM) 및 딥 러닝(Deep Learning)에 대하여 설명하도록 한다.
Hereinafter, a hidden markov model (HMM) and deep learning used in a method for generating a voice montage according to an embodiment of the present invention will be described.

은닉 마르코프 모델(Hidden Markov Model, HMM)은 통계적 마르코프 모델의 하나로, 시스템이 은닉된 상태와 관찰 가능한 결과의 두 가지 요소로 이루어졌다고 보는 모델이다. 관찰 가능한 결과를 야기하는 직접적인 원인은 관측될 수 없는 은닉 상태들이고, 오직 그 상태들이 마르코프 과정을 통해 도출된 결과들만이 관찰될 수 있기 때문에 ‘은닉’이라는 단어가 붙게 되었다.
The Hidden Markov Model (HMM) is one of the statistical Markov models, and is a model in which the system is composed of two elements: the hidden state and observable results. The word'hidden' was added because the direct causes of observable results are unobservable hidden states, and only those results derived through the Markov process can be observed.

도 4는 은닉 마르코프 모델(Hidden Markov Model, HMM)을 설명하기 위해 도시한 도면이다. 도 4에서 x는 상태들, y는 얻을 수 있는 관측값들, a는 상태 전이 확률들, 및 b는 출력 확률들을 의미한다. 도 4에 도시된 바와 같이, 관찰자는 각 상태에서 뽑혀 나온 y1, y2, y3 및 y4만을 관측할 수 있으며, 심지어 관찰자가 내부의 공들의 비율을 알고 있고 y1, y2, y3을 관찰 했더라도, 관찰자는 여전히 내부 상태를 알 수 없으며, 다만 가능도와 같은 정보들에 대해서 계산할 수 있을 뿐이다.
FIG. 4 is a diagram for explaining a Hidden Markov Model (HMM). In FIG. 4, x denotes states, y denotes observed values, a denotes state transition probabilities, and b denotes output probabilities. As shown in Fig. 4, the observer can observe only y1, y2, y3, and y4 drawn from each state, even if the observer knows the proportion of the balls inside and observes y1, y2, y3 Is still unable to know the internal state, but can only calculate information such as likelihood.

은닉 마르코프 모델의 학습은 해당 결과가 나올 확률을 극대화 시키는 전이 확률과 출력 확률을 구하는 것으로서 이루어질 수 있다. 이 과정은 대체로 주어진 관찰 결과에 기반을 두어 최대 가능도 방법을 유도함으로써 이루어질 수 있다.
The learning of the hidden Markov model can be achieved by finding the transition probability and the output probability that maximize the probability of the corresponding result. This can usually be done by deriving the maximum likelihood method based on the observations given.

인공신경망(Artificial Neural Network, ANN)은 기계학습과 인지과학에서 사용되며, 생물학의 신경망(동물의 중추신경계 중 특히 뇌)에서 영감을 얻은 통계학적 학습 알고리즘이다. 인공신경망은 시냅스의 결합으로 네트워크를 형성한 인공 뉴런(노드)이 학습을 통해 시냅스의 결합 세기를 변화시켜, 문제 해결 능력을 가지는 모델 전반을 가리킨다. 좁은 의미에서는 오차역전파법을 이용한 다층 퍼셉트론을 가리키는 경우도 있지만, 이것은 잘못된 용법으로, 인공신경망은 이에 국한되지 않는다.
Artificial Neural Network (ANN) is a statistical learning algorithm used in machine learning and cognitive science, inspired by the neural network of biology (especially the brain of the animal's central nervous system). The artificial neural network refers to an overall model that has a problem-solving ability because artificial neurons (nodes) that form a network through synaptic coupling change the intensity of synaptic binding through learning. In a narrow sense, it may refer to a multi-layer perceptron using error back propagation, but this is a misuse, and the artificial neural network is not limited thereto.

딥 러닝(Deep Learning)은, 여러 비선형 변환기법의 조합을 통해 높은 수준의 추상화를 시도하는 기계학습 알고리즘의 집합으로 정의되며, 큰 들에서는 사람의 사고방식을 컴퓨터에게 가르치는 기계학습의 한 분야이다.
Deep learning is defined as a set of machine learning algorithms that attempt a high level of abstraction through a combination of several nonlinear translator methods, and is a field of machine learning that teaches people's minds to computers in large fields.

도 5는 인공신경망 모델 중 MLP(Multi-Layer Perceptron) 모델을 설명하기 위해 도시한 도면이다. 도 5에 도시된 바와 같이, MLP 모델은 입력층과 출력층 사이에 하나 이상의 중간층이 존재하는 신경망으로, 입력층과 출력층 사이에 중간층을 은닉층(hidden layer)라고 부른다. 네트워크는 입력층, 은닉층, 출력층 방향으로 연결되어 있으며, 각 층 내의 연결과 출력층에서 입력층으로의 직접적인 연결은 존재하지 않는 전방향(Feedforward) 네트워크이다.
FIG. 5 is a diagram illustrating a multi-layer perceptron (MLP) model among artificial neural network models. As shown in FIG. 5, the MLP model is a neural network in which one or more intermediate layers exist between the input layer and the output layer, and the intermediate layer between the input layer and the output layer is called a hidden layer. The network is connected to the input layer, the hidden layer, and the output layer, and there is no direct connection from each layer to the input layer from the output layer.

MLP 모델은, 단층 perceptron과 유사한 구조를 가지고 있지만 중간층과 각 unit의 입출력 특성을 비선형으로 함으로써, 네트워크의 능력을 향상시켜 단층 perceptron의 여러 가지 단점을 극복하였다. MLP 모델은 층의 개수가 증가할수록 perceptron이 형성하는 결정 구역의 특성은 더욱 고급화된다. 보다 구체적으로는, 단층일 경우 패턴공간을 두 구역으로 나누어주고, 2층인 경우 볼록한(convex) 개구역 또는 오목한 폐구역을 형성하며, 3층인 경우에는 이론상 어떠한 형태의 구역도 형성할 수 있다.
The MLP model has a structure similar to that of the single-layer perceptron, but overcomes various disadvantages of the single-layer perceptron by improving the network capability by making the input/output characteristics of the middle layer and each unit nonlinear. In the MLP model, as the number of layers increases, the characteristics of the crystal region formed by perceptrons become more advanced. More specifically, in the case of a single layer, the pattern space is divided into two sections, and in the case of the second floor, a convex open zone or a concave closed zone is formed, and in the case of the third floor, any type of zone may be formed in theory.

일반적으로, 입력층의 각 unit에 입력 데이터를 제시하면, 이 신호는 각 unit에서 변환되어 중간층에 전달되고, 최종적으로 출력층으로 출력되게 되는데, 이 출력값과 원하는 출력값을 비교하여 그 차이를 감소시키는 방향으로 연결강도를 조절하여 MLP 모델을 학습시킬 수 있다.
In general, when input data is presented to each unit of the input layer, this signal is converted from each unit and transmitted to the middle layer, and finally output to the output layer. The direction of comparing the output value with a desired output value to reduce the difference The MLP model can be trained by adjusting the connection strength.

도 6은 딥 러닝(Deep Learning) 모델 중 RNN(Recurrent Neural Networks) 모델을 설명하기 위해 도시한 도면이다. 도 6에 도시된 바와 같이, RNN 모델은 A라고 표시된 부분이 hidden state로서, hidden state가 방향을 가진 엣지로 연결돼 순환구조(directed cycle)를 이루는 딥 러닝(Deep Learning)의 한 종류로서, 음성, 문자 등 순차적으로 등장하는 데이터 처리에 적합한 모델로 알려져 있다.
FIG. 6 is a diagram illustrating a Recurrent Neural Networks (RNN) model among deep learning models. As shown in FIG. 6, in the RNN model, a part marked with A is a hidden state, and a hidden state is connected to an edge with a direction to form a cyclic structure (directed cycle), which is a type of deep learning. It is known as a model suitable for processing data that appears sequentially, such as characters.

RNN 모델은, 시퀀스 길이에 관계없이 인풋과 아웃풋을 받아들일 수 있는 네트워크 구조이기 때문에, 필요에 따라 다양하고 유연하게 구조를 만들 수 있다는 장점이 있다.
Since the RNN model is a network structure that can accept inputs and outputs regardless of the sequence length, it has the advantage of being able to make various and flexible structures according to needs.

또한, RNN 모델은, 순환 구조를 이루고 있고, hidden layer가 여러 개로 펼쳐져 있는 것으로서, 현재 상태의 hidden state는 직전 시점의 hidden state를 받아 갱신될 수 있으며, state 활성함수(activation function)로는 비선형 함수인 하이퍼볼릭탄젠트를 사용할 수 있다.
In addition, the RNN model has a circular structure, and several hidden layers are unfolded, and the hidden state of the current state can be updated by receiving the hidden state of the immediately preceding point, and the state activation function is a nonlinear function. You can use hyperbolic tangents.

뿐만 아니라, RNN 모델은, 인풋에서 hidden layer로 보내는 값, 이전 hidden layer에서 다음 hidden layer로 보내는 값, 및 hidden layer에서 아웃풋으로 보내는 값을 통해 학습될 수 있다.
In addition, the RNN model may be trained through values sent from the input to the hidden layer, values sent from the previous hidden layer to the next hidden layer, and values sent from the hidden layer to the output.

하지만, RNN 모델은, 관련 정보와 그 정보를 사용하는 지점 사이 거리가 멀 경우 역전파시 그래디언트가 점차 줄어드는 현상인 vanishing gradient problem이 발생하여 학습 능력이 크게 저하되는 것으로 알려져 있다. 이를 극복하기 위하여 고안된 것이 바로 LSTM(Long Short Term Memory) 모델이다.
However, in the RNN model, it is known that when the distance between the related information and the point using the information is long, a vanishing gradient problem, which is a phenomenon in which the gradient gradually decreases during back propagation, occurs, and thus the learning ability is significantly deteriorated. The LSTM (Long Short Term Memory) model was designed to overcome this problem.

도 7은 딥 러닝(Deep Learning) 모델 중 LSTM(Long Short Term Memory) 모델을 설명하기 위해 도시한 도면이다. 도 7에 도시된 바와 같이, LSTM 모델은, 기존의 RNN 모델의 hidden state에 cell-state를 추가한 구조로서, 추가된 cell-state는 일종의 컨베이어 벨트 역할을 할 수 있어, 오랜 시간이 경과하여도 state로 그래디언트가 잘 전파될 수 있다.
FIG. 7 is a diagram illustrating a Long Short Term Memory (LSTM) model among deep learning models. As shown in FIG. 7, the LSTM model is a structure in which a cell-state is added to a hidden state of an existing RNN model, and the added cell-state can serve as a kind of conveyor belt, even after a long time has passed. Gradients can propagate well into the state.

LSTM 모델은, RNN 모델과 마찬가지로 순환 구조를 갖기고 있지만, 단일 뉴럴 네트워크 레이어를 가지는 RNN 모델과는 달리, 4개의 상호작용이 가능한 특별한 방식의 구조를 가질 수 있다.
The LSTM model, like the RNN model, has a cyclic structure, but unlike the RNN model having a single neural network layer, it can have a structure in a special way capable of four interactions.

또한, LSTM 모델은, 마이너한 연산과정을 거치고 전체 체인을 관통하는 cell-state, 정보들이 선택적으로 cell-state로 들어갈 수 있도록 하는 gate, 및 각 구성요소가 얼마만큼의 영향을 주게 될지를 결정하는 sigmoid layer를 포함하여 구성될 수 있다. 이때, sigmoid layer은, 0과 1을 출력하는데, 0이라는 값을 가지게 된다면, 해당 구성요소가 미래의 결과에 아무런 영향을 주지 않도록 만드는 것이고, 반면에, 1이라는 값은 해당 구성요소가 확실히 미래의 예측결과에 영향을 주도록 데이터가 흘러가게 만들 수 있으며, gate는 sigmoid 또는 tanh function으로 구성될 수 있다.
In addition, the LSTM model undergoes a minor computation process, a cell-state that penetrates the entire chain, a gate that allows information to selectively enter the cell-state, and a sigmoid that determines how much each component will affect. It may be configured to include a layer. At this time, the sigmoid layer outputs 0 and 1, and if it has a value of 0, it means that the corresponding component has no effect on the future result, whereas the value of 1 means that the component is definitely the future Data can be made to flow to affect the prediction result, and the gate can be composed of sigmoid or tanh functions.

뿐만 아니라, LSTM 모델은, cell state의 값을 바꾸고 기억하거나 잊어버리는 단계, 어떤 정보를 cell state에 담을 것인지 결정하는 단계, 및 어떤 값을 출력으로 할지 결정하는 단계를 통해 결과값을 출력할 수 있다.
In addition, the LSTM model can output the result value through the step of changing and remembering or forgetting the value of the cell state, deciding what information to put in the cell state, and deciding which value to output. .

cell state의 값을 바꾸고 기억하거나 잊어버리는 단계에서는, LSTM 모델은 cell state 값을 잊어버릴지 가져갈지 결정하는 forget gate layer을 가질 수 있는데, forget gate layer은 입력값을 보고 sigmoid function을 통과시켜서 0에서 1 사이의 값을 가지게 하여, cell state 값을 잊어버릴지 가져갈지 결정할 수 있다.
In the step of changing and remembering or forgetting the value of the cell state, the LSTM model can have a forget gate layer that determines whether to forget or take the cell state value, and the forget gate layer sees the input value and passes the sigmoid function to 0 to 1 By having a value in between, you can decide whether to forget or take the cell state value.

어떤 정보를 cell state에 담을 것인지 결정하는 단계에서는, input gate layer 라고 불리는 sigmoid layer가 어떤 값을 업데이트 할 지 결정하고, tanh layer가 어떤 후보 값들을 만들어내어, 이렇게 만들어진 두 개의 값을 서로 곱하여, 어떤 정보를 cell state에 담을 것인지 결정할 수 있다.
In the step of deciding what information to include in the cell state, the sigmoid layer called the input gate layer decides what value to update, and the tanh layer produces what candidate values, multiplies these two values, and multiplies each other. You can decide whether to put the information in the cell state.

어떤 값을 출력으로 할지 결정하는 단계에서는, cell state에 tanh를 씌워서 -1에서 1 사이의 값을 만들고, 입력된 값에서 나온 activation 값을 tanh layer에서 나온 값과 곱해서 출력할 수 있다.
In the step of deciding which value to output, the cell state can be written with a tanh to create a value between -1 and 1, and the activation value from the input value can be multiplied by the value from the tanh layer to be output.

CNN(Convolutional Neural Network) 모델은, 하나 또는 여러 개의 콘볼루션 계층(convolutional layer)과 통합 계층(pooling layer), 완전하게 연결된 계층(fully connected layer)들로 구성된 신경망 모델이다. CNN 모델은, 2차원 데이터의 학습에 적합한 구조를 가지고 있으며, 역전파 알고리즘(Backpropagation algorithm)을 통해 훈련될 수 있어, 영상 내 객체 분류, 객체 탐지 등 다양한 응용 분야에 폭넓게 활용될 수 있다.
The CNN (Convolutional Neural Network) model is a neural network model composed of one or several convolutional layers, a pooling layer, and fully connected layers. The CNN model has a structure suitable for learning two-dimensional data, and can be trained through a backpropagation algorithm, and thus can be widely used in various application fields such as object classification in an image and object detection.

콘볼루션 계층은, 입력 데이터로부터 특징을 추출하는 역할을 할 수 있다. 콘볼루션 계층은 특징을 추출하는 기능을 하는 필터(filter)와, 필터에서 추출된 값을 비선형 값으로 바꾸어주는 액티베이션 함수(activation function)로 이루어질 수 있다.
The convolution layer can serve to extract features from the input data. The convolution layer may consist of a filter that functions to extract features and an activation function that converts the values extracted from the filter into nonlinear values.

도 8은 딥 러닝(Deep Learning) 모델 중 CNN(Convolutional Neural Network) 모델을 설명하기 위해 도시한 도면이다. 도 8에 도시된 바와 같이, CNN 모델은, 첫 번째로, 3개의 필터 사이즈 2, 3, 4를 각 두 개씩 총 6개를 문장 매트릭스에 합성곱을 수행하고 피쳐 맵을 생성하고, 두 번째로, 각 맵에 대해 맥스 풀링을 진행하여 각 피쳐 맵으로부터 가장 큰 수를 남긴 후, 세 번째로, 이들 6개 맵에서 단변량(univariate) 벡터가 생성되고, 이들 6개 피쳐는 두 번째 레이어를 위한 피쳐 벡터로 연결되는데, 마지막으로 소프트맥스 레이어는 피쳐 값을 받아 문장, 사진, 음성 등을 분류할 수 있다.
FIG. 8 is a diagram illustrating a convolutional neural network (CNN) model among deep learning models. As shown in FIG. 8, the CNN model firstly constructs a feature matrix on six sentence matrixes of two filter sizes 2, 3, and 4, generates a feature map, and secondly, After max pooling for each map, leaving the largest number from each feature map, and thirdly, a univariate vector is generated from these six maps, and these six features are features for the second layer. It is connected as a vector. Finally, the Softmax layer can classify sentences, photos, and voices by receiving feature values.

CNN 모델은, 경사하강법(gradient descent)와 역전파(backpropagation) 알고리즘을 통해 학습시킬 수 있다. 이때, 경사하강법은 1차 근사값 발견용 최적화 알고리즘으로서, 함수의 기울기(경사)를 구하여 기울기가 낮은 쪽으로 계속 이동시켜서 극값에 이를 때까지 반복시키는 방법이고, 역전파 알고리즘은, 다층 퍼셉트론 학습에 사용되는 통계적 기법을 의미하는 것으로서, 동일 입력층에 대해 원하는 값이 출력되도록 개개의 weight를 조정하는 방법이다.
The CNN model can be trained through gradient descent and backpropagation algorithms. At this time, the gradient descent method is an optimization algorithm for first-order approximation values. It is a method of finding the gradient (slope) of a function and continuously moving the gradient to the lower side and repeating it until it reaches an extreme value. This refers to a statistical technique, which is a method of adjusting individual weights so that a desired value is output for the same input layer.

본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 단계 S310에서는, 전술한 바와 같은, 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(deep learning)을 이용하여 다화자 음성 합성기를 학습시킬 수 있어, 빠르게 다화자 음성 합성기를 학습시키고, 출력되는 음성 몽타주의 정확도를 높일 수 있다.
In step S310 of the method for generating a speech montage according to an embodiment of the present invention, as described above, a multi-talk speech synthesizer can be trained using a hidden Markov Model (HMM) or deep learning. Therefore, it is possible to quickly learn a multi-speech speech synthesizer and increase the accuracy of the output speech montage.

단계 S320에서는, 단계 S210 내지 단계 S230을 통해 설정된 특징 파라미터와 단계 S310에서 학습된 다화자 음성 합성기를 이용하여 음성 몽타주를 생성할 수 있다. 보다 구체적으로는, 본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 단계 S320에서는, 단계 S210 내지 단계 S230을 통해 설정된 화자, 감정 및 음성 스타일과 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(deep learning)으로 학습된 다화자 음성 합성기를 이용하여 음성 몽타주를 생성할 수 있다.
In step S320, a voice montage may be generated using the feature parameters set in steps S210 to S230 and the multi-speech speech synthesizer learned in step S310. More specifically, in step S320 of the method for generating a voice montage according to an embodiment of the present invention, a speaker, emotion, and voice style set through steps S210 to S230 and a hidden Markov Model (HMM) or deep learning A voice montage can be generated using a multi-talker speech synthesizer trained with (deep learning).

본 발명에서는, 파라미터 방식의 음성 합성 기법을 이용하여, 음편을 바로 사용하지 않고, 각 음편을 특징 파라미터로 변환하고 모델링을 통해 대푯값을 생성한 후, 음성을 합성하여 화자, 감정 및 음성 스타일이 설정된 음성 몽타주를 생성할 수 있다.
In the present invention, by using a parametric speech synthesis technique, instead of directly using the melody, each melody is converted into a feature parameter, and a representative value is generated through modeling, and then speech is synthesized to set the speaker, emotion, and speech style. You can create a voice montage.

본 발명의 일실시예에 따른 음성 몽타주 생성 방법에서 이용되는 다화자 음성 합성기는 음성 몽타주를 생성하기 위한 음성 합성기이며, 여러 화자로 학습이 이루어지기 때문에, 각 화자의 음성을 생성할 수 있을 뿐만 아니라, 두 개 이상의 음성을 혼합하여 새로운 음성을 생성할 수 있다.
The multi-speech speech synthesizer used in the method for generating a speech montage according to an embodiment of the present invention is a speech synthesizer for generating a speech montage, and since learning is performed by several speakers, it is possible to generate speeches of each speaker , By mixing two or more voices, a new voice can be generated.

단계 S400에서는, 단계 S300에서 생성된 음성 몽타주를 출력할 수 있다. 보다 구체적으로, 본 발명의 일실시예에 따른 음성 몽타주 생성 방법의 단계 S400에서는, 단계 S100 내지 단계 S300을 통해 생성된 음성 몽타주를 출력할 수 있다.
In step S400, the voice montage generated in step S300 may be output. More specifically, in step S400 of the method for generating a voice montage according to an embodiment of the present invention, the voice montage generated through steps S100 to S300 may be output.

도 9는 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)의 구성을 도시한 도면이다. 도 9에 도시된 바와 같이, 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)은, 음성 몽타주 생성 시스템(10)으로서, 문장을 입력하는 입력부(100), 입력부(100)에 의해 입력된 문장에 대해 특징 파라미터를 설정하는 파라미터 설정부(200), 파라미터 설정부(200)에 의해 설정된 특징 파라미터 및 다화자 음성 합성기를 이용하여 음성 몽타주를 생성하는 음성 몽타주 생성부(300), 및 음성 몽타주 생성부(300)에 의해 생성된 음성 몽타주를 출력하는 출력부(400)를 포함하여 구성될 수 있다.
9 is a view showing the configuration of a voice montage generation system 10 according to an embodiment of the present invention. 9, the voice montage generation system 10 according to an embodiment of the present invention, as the voice montage generation system 10, input by the input unit 100, input unit 100 for inputting a sentence The parameter setting unit 200 for setting the feature parameters for the sentence, the voice parameter montage generation unit 300 for generating a voice montage using the feature parameters set by the parameter setting unit 200 and a multi-speech speech synthesizer, and voice It may be configured to include an output unit 400 for outputting the voice montage generated by the montage generation unit 300.

입력부(100)는, 문장을 입력할 수 있다. 보다 구체적으로는, 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)의 입력부(100)는, 음성 몽타주로 출력하고자 하는 목소리의 문장을 입력할 수 있다. 이때, 음성 몽타주 사용자의 기억과 유사하게 음성 몽타주를 생성하기 위해서 기억하는 상황의 문장을 음성 합성 샘플로 활용하여 입력할 수 있다.
The input unit 100 may input a sentence. More specifically, the input unit 100 of the voice montage generation system 10 according to an embodiment of the present invention may input a sentence of a voice to be output as a voice montage. At this time, in order to generate a voice montage, similar to the voice montage user's memory, a sentence of a memory situation may be used as a voice synthesis sample and input.

파라미터 설정부(200)는, 입력부(100)에 의해 입력된 문장에 대해 특징 파라미터를 설정할 수 있다. 도 10은 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)에 있어서 파라미터 설정부(200)의 세부적인 구성을 도시한 도면이다. 도 10에 도시된 바와 같이, 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)의 파라미터 설정부(200)는, 입력부(100)에 의해 입력된 문장에 대해 화자를 설정하는 화자 설정 모듈(210), 화자 설정 모듈(210)에 의해 화자가 설정된 문장에 대해 감정을 설정하는 감정 설정 모듈(220), 및 감정 설정 모듈(220)에 의해 감정이 설정된 문장에 대해 음성 스타일을 설정하는 음성 스타일 설정 모듈(230)을 포함하여 구성될 수 있다.
The parameter setting unit 200 may set feature parameters for sentences input by the input unit 100. 10 is a view showing the detailed configuration of the parameter setting unit 200 in the voice montage generation system 10 according to an embodiment of the present invention. As shown in FIG. 10, the parameter setting unit 200 of the voice montage generation system 10 according to an embodiment of the present invention sets a speaker setting module for setting a speaker for a sentence input by the input unit 100 (210), the emotion setting module 220 for setting emotions for the sentence set by the speaker by the speaker setting module 210, and the voice for setting the voice style for the sentence set by the emotion setting module 220 It may be configured to include a style setting module 230.

본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)의 파라미터 설정부(200)에서의 특징 파라미터는, 화자, 감정 및 음성 스타일일 수 있으며, 보다 구체적으로는, 특징 파라미터는, 화자, 감정, 음성의 높낮이, 음성의 속도, 음성의 크기 및 발음일 수 있다. 다만, 상기의 화자, 감정, 음성의 높낮이, 음성의 속도, 음성의 크기 및 발음으로 특징 파라미터를 한정하는 것은 아니다.
The feature parameter in the parameter setting unit 200 of the voice montage generation system 10 according to an embodiment of the present invention may be a speaker, emotion, and voice style, and more specifically, the feature parameter may be a speaker, emotion , Voice height, voice speed, voice size and pronunciation. However, the feature parameters are not limited to the speaker, emotion, voice height, voice speed, voice size, and pronunciation.

화자 설정 모듈(210)은, 입력부(100)에서 입력된 문장에 대해 화자를 설정할 수 있다. 보다 구체적으로는, 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)의 화자 설정 모듈(210)은, 음색에 중점을 두며, 설정된 화자의 음성 특징들을 평균적으로 반영하고, 가중치를 활용하여 입력부(100)에서 입력된 문장에 대해 화자를 설정할 수 있다. 예를 들면, 성별, 나이대 등을 이용하여 입력부(100)에서 입력된 문장에 대해 화자를 설정할 수 있다.
The speaker setting module 210 may set a speaker for a sentence input from the input unit 100. More specifically, the speaker setting module 210 of the voice montage generating system 10 according to an embodiment of the present invention focuses on a tone, reflects the set speaker's voice characteristics on average, and utilizes weights A speaker can be set for a sentence input from the input unit 100. For example, a speaker may be set for a sentence input from the input unit 100 using gender, age, and the like.

보다 구체적으로, 화자 설정 모듈(210)은, 설정된 화자의 음성 특징들을 평균적으로 반영하여 생성한 합성음을 사용자에게 들려주고, 생성할 합성음이 선택된 화자들 중 어느 화자에 얼마나 더 가까워야 하는지에 대한 질의에 대한 답변을 사용자로부터 입력받으며, 입력받은 답변에 따라 화자 선택의 가중치를 결정할 수 있다. 이렇게 결정된 가중치를 반영하여 다시 생성한 합성음을 사용자에게 다시 들려주고, 가중치가 올바로 선택되었다고 판단될 때까지 반복적으로 시도함으로써, 사용자가 원하는 음성에 가까운 음성을 생성할 수 있다.
More specifically, the speaker setting module 210 provides the user with a synthesized sound generated by reflecting the voice characteristics of the set speaker on average, and inquires about how much closer the synthesized sound to be generated is to which of the selected speakers The response to the input is received from the user, and the weight of the speaker selection can be determined according to the inputted response. The synthesized sound re-generated by reflecting the weight determined as described above is again heard by the user, and repeatedly attempted until it is determined that the weight is correctly selected, thereby generating a voice close to the voice desired by the user.

감정 설정 모듈(220)은, 화자 설정 모듈(210)에 의해 화자가 설정된 문장에 대해 감정을 설정할 수 있다. 보다 구체적으로는, 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)의 감정 설정 모듈(220)은, 음성 몽타주 사용자가 기억하는 상황의 감정을 설정하여, 최종적으로 출력되는 음성 몽타주가 목표하는 용의자의 음성과 비슷하도록 유도할 수 있다. 예를 들면, 분노, 슬픔, 기쁨 등의 감정을 설정할 수 있으며, 또한, 여러 감정을 혼합하여 화자 설정 모듈(210)에 의해 화자가 설정된 문장에 대해 감정을 설정할 수 있다.
The emotion setting module 220 may set emotion for a sentence set by the speaker by the speaker setting module 210. More specifically, the emotion setting module 220 of the voice montage generation system 10 according to an embodiment of the present invention sets the emotion of a situation that a voice montage user remembers, and finally outputs the voice montage target Can be induced to resemble the suspect's voice. For example, emotions such as anger, sadness, and joy may be set, and emotions may be set for a sentence set by the speaker by the speaker setting module 210 by mixing various emotions.

음성 스타일 설정 모듈(230)은, 감정 설정 모듈(220)에 의해 감정이 설정된 문장에 대해 음성 스타일을 설정할 수 있다. 보다 구체적으로는, 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)의 음성 스타일 설정 모듈(230)은, 화자 설정 모듈(210) 및 감정 설정 모듈(220)에 의해 화자 및 감정이 설정된 문장에 대해, 음성의 높낮이, 음성의 속도, 음성의 크기 및 발음을 설정할 수 있다.
The voice style setting module 230 may set a voice style for sentences in which emotion is set by the emotion setting module 220. More specifically, the voice style setting module 230 of the voice montage generation system 10 according to an embodiment of the present invention has the speaker and emotion set by the speaker setting module 210 and the emotion setting module 220 For sentences, you can set the pitch of the voice, the speed of the voice, the volume of the voice, and the pronunciation.

음성 몽타주 생성부(300)는, 파라미터 설정부(200)에서 설정된 특징 파라미터 및 다화자 음성 합성기를 이용하여 음성 몽타주를 생성할 수 있다. 도 11은 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)에 있어서, 음성 몽타주 생성부(300)의 세부적인 구성을 도시한 도면이다. 도 11에 도시된 바와 같이, 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)의 음성 몽타주 생성부(300)는, 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(deep learning)을 이용하여 다화자 음성 합성기를 학습시키는 학습 모듈(310), 및 화자 설정 모듈(210), 감정 설정 모듈(220) 및 음성 스타일 설정 모듈(230)을 통해 설정된 특징 파라미터와 학습 모듈(310)에 의해 학습된 다화자 음성 합성기를 이용하여 음성 몽타주를 생성하는 음성 몽타주 생성 모듈(320)을 포함하여 구성될 수 있다.
The voice montage generation unit 300 may generate a voice montage using a feature parameter set in the parameter setting unit 200 and a multi-speaker speech synthesizer. 11 is a diagram illustrating a detailed configuration of the voice montage generation unit 300 in the voice montage generation system 10 according to an embodiment of the present invention. As illustrated in FIG. 11, the voice montage generation unit 300 of the voice montage generation system 10 according to an embodiment of the present invention may include a hidden Markov Model (HMM) or deep learning. Using the learning module 310 to learn the multi-speaker speech synthesizer, and the speaker setting module 210, the emotion setting module 220 and the voice style setting module 230 through the feature parameters and learning module 310 It may be configured to include a voice montage generation module 320 for generating a voice montage using a multi-speech speech synthesizer learned by.

학습 모듈(310)은, 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(deep learning)을 이용하여 다화자 음성 합성기를 학습시킬 수 있다. 보다 구체적으로는, 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)의 학습 모듈(310)은, 전술한 바와 같은, 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(deep learning)을 이용하여 다화자 음성 합성기를 학습시킬 수 있어, 빠르게 다화자 음성 합성기를 학습시키고, 출력되는 음성 몽타주의 정확도를 높일 수 있다.
The learning module 310 may train a multi-speaker speech synthesizer using a hidden Markov Model (HMM) or deep learning. More specifically, the learning module 310 of the speech montage generation system 10 according to an embodiment of the present invention, as described above, Hidden Markov Model (HMM) or deep learning (deep learning) It is possible to train a multi-speech speech synthesizer by using, so that a multi-speech speech synthesizer can be quickly learned and the accuracy of the output speech montage can be increased.

음성 몽타주 생성 모듈(320)은, 화자 설정 모듈(210), 감정 설정 모듈(220) 및 음성 스타일 설정 모듈(230)을 통해 설정된 특징 파라미터와 학습 모듈(310)에 의해 학습된 다화자 음성 합성기를 이용하여 음성 몽타주를 생성할 수 있다. 보다 구체적으로는, 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)의 음성 몽타주 생성 모듈(320)은, 화자 설정 모듈(210), 감정 설정 모듈(220) 및 음성 스타일 설정 모듈(230)을 통해 설정된 화자, 감정 및 음성 스타일과 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(deep learning)으로 학습된 다화자 음성 합성기를 이용하여 음성 몽타주를 생성할 수 있다.
The voice montage generation module 320 is a multi-speaker speech synthesizer learned by the feature parameter and the learning module 310 set through the speaker setting module 210, the emotion setting module 220, and the voice style setting module 230. Can be used to generate a voice montage. More specifically, the voice montage generation module 320 of the voice montage generation system 10 according to an embodiment of the present invention includes a speaker setting module 210, an emotion setting module 220, and a voice style setting module 230 Voice montage can be generated using a multi-speech speech synthesizer trained with a speaker, emotion, and speech style set through the Hidden Markov Model (HMM) or deep learning.

출력부(400)는, 음성 몽타주 생성부(300)에서 생성된 음성 몽타주를 출력할 수 있다. 보다 구체적으로, 본 발명의 일실시예에 따른 음성 몽타주 생성 시스템(10)의 출력부(400)는, 입력부(100) 내지 음성 몽타주 생성부(300)을 통해 생성된 음성 몽타주를 출력할 수 있다.
The output unit 400 may output the voice montage generated by the voice montage generation unit 300. More specifically, the output unit 400 of the voice montage generation system 10 according to an embodiment of the present invention may output the voice montage generated through the input unit 100 to the voice montage generation unit 300. .

전술한 바와 같이, 본 발명에서 제안하고 있는 음성 몽타주 생성 방법 및 시스템(10)에 따르면, 다화자 음성 합성기를 기반으로 각 화자의 각기 다른 특징 파라미터를 설정함으로써, 찾고자하는 용의자의 목소리와 유사한 음성을 합성하여 출력할 수 있다. 또한, 본 발명에 따르면, 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(Deep Learning)을 이용하여 다화자 음성 합성기를 학습시킴으로써, 빠르게 다화자 음성 합성기를 학습시키고, 출력되는 음성 몽타주의 정확도를 높일 수 있다. 뿐만 아니라, 본 발명에 따르면, 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(Deep Learning)으로 학습된 다화자 음성 합성기를 사용함으로써, 각 화자의 음성을 만들 수 있을 뿐만 아니라, 두 개 이상의 음색을 혼합하여 목적으로 하는 화자의 음색을 효과적으로 합성하여 출력할 수 있다.
As described above, according to the voice montage generation method and system 10 proposed in the present invention, by setting different feature parameters of each speaker based on the multi-speech speech synthesizer, a voice similar to the suspect's voice to be searched is generated. Can be synthesized and output. In addition, according to the present invention, by learning a multi-speech speech synthesizer using a hidden Markov Model (HMM) or deep learning, the multi-speech speech synthesizer is quickly trained and the accuracy of the output speech montage is achieved. Can increase. In addition, according to the present invention, by using a polyphonic speech synthesizer trained by a Hidden Markov Model (HMM) or Deep Learning, not only can each speaker's voice be made, but also two or more The tone can be mixed to effectively synthesize and output the tone of the intended speaker.

이상 설명한 본 발명은 본 발명이 속한 기술분야에서 통상의 지식을 가진 자에 의하여 다양한 변형이나 응용이 가능하며, 본 발명에 따른 기술적 사상의 범위는 아래의 특허청구범위에 의하여 정해져야 할 것이다.The present invention described above can be variously modified or applied by a person having ordinary knowledge in the technical field to which the present invention belongs, and the scope of the technical idea according to the present invention should be defined by the following claims.

10: 음성 몽타주 생성 시스템
100: 입력부
200: 파라미터 설정부
210: 화자 설정 모듈
220: 감정 설정 모듈
230: 음성 스타일 설정 모듈
300: 음성 몽타주 생성부
310: 학습 모듈
320: 음성 몽타주 생성 모듈
400: 출력부
S100: 문장을 입력하는 단계
S200: 단계 S100에서 입력된 문장에 대해 특징 파라미터를 설정하는 단계
S210: 단계 S100에서 입력된 문장에 대해 화자를 설정하는 단계
S220: 단계 S210에서 화자가 설정된 문장에 대해 감정을 설정하는 단계
S230: 단계 S220에서 감정이 설정된 문장에 대해 음성 스타일을 설정하는 단계
S300: 단계 S200에서 설정된 특징 파라미터 및 다화자 음성 합성기를 이용하여 음성 몽타주를 생성하는 단계
S310: 은닉 마르코프 모델(Hidden Markov Model, HMM) 또는 딥 러닝(deep learning)을 이용하여 다화자 음성 합성기를 학습시키는 단계
S320: 단계 S210 내지 단계 S230을 통해 설정된 특징 파라미터와 단계 S310에서 학습된 다화자 음성 합성기를 이용하여 음성 몽타주를 생성하는 단계
S400: 단계 S300에서 생성된 음성 몽타주를 출력하는 단계10: voice montage generation system
100: input
200: parameter setting unit
210: speaker setting module
220: emotion setting module
230: voice style setting module
300: voice montage generator
310: learning module
320: voice montage generation module
400: output
S100: Step to input a sentence
S200: Step of setting feature parameters for the sentence input in step S100
S210: setting a speaker for the sentence input in step S100
S220: Step of setting emotion for the sentence set in step S210
S230: setting a voice style for the sentence with emotions set in step S220
S300: generating a voice montage using the feature parameter set in step S200 and the multi-speech speech synthesizer
S310: Step of training a multi-speaker speech synthesizer using a hidden Markov Model (HMM) or deep learning
S320: generating a voice montage using the feature parameters set in steps S210 to S230 and the multi-speech speech synthesizer learned in step S310.
S400: Outputting the voice montage generated in step S300

Claims

As a method for generating a voice montage,
(1) inputting a sentence;
(2) setting feature parameters for the sentence input in step (1);
(3) generating a voice montage using the feature parameter set in step (2) and a multi-speech speech synthesizer; And
And (4) outputting the voice montage generated in step (3).

The method of claim 1, wherein the feature parameter in step (2) is:
A method for generating a voice montage, characterized in that it is a speaker, emotion, and voice style.

The method of claim 2, wherein the voice style,
A method for generating a voice montage, characterized in that the height of the voice, the speed of the voice, the size and pronunciation of the voice.

The method according to claim 2, wherein the step (2),
(2-1) setting a speaker for the sentence input in step (1);
(2-2) setting emotion for a sentence set by the speaker in step (2-1); And
(2-3) A voice montage generating method comprising the step of setting a voice style for a sentence in which emotion is set in the step (2-2).

The method of claim 4, wherein the step (3),
(3-1) training a multi-speaker speech synthesizer using a Hidden Markov Model (HMM) or deep learning; And
(3-2) generating a voice montage using the feature parameter set through the steps (2-1) to (2-3) and the polyphonic speech synthesizer learned in the step (3-1). Characterized in that it comprises, a method for generating a voice montage.

As a voice montage generation system,
An input unit 100 for inputting sentences;
A parameter setting unit 200 for setting feature parameters for a sentence input by the input unit 100;
A voice montage generator 300 for generating a voice montage using the feature parameter set by the parameter setting unit 200 and a multi-speaker speech synthesizer; And
And an output unit 400 for outputting the voice montage generated by the voice montage generator 300.

The method of claim 6, wherein the feature parameter,
Voice montage generation system, characterized in that the speaker, emotion and voice style.

The method of claim 7, wherein the speech style,
Voice montage generation system, characterized in that the height of the voice, the speed of the voice, the size and pronunciation of the voice.

The method of claim 7, wherein the parameter setting unit 200,
A speaker setting module 210 for setting a speaker for a sentence input by the input unit 100;
An emotion setting module 220 for setting emotions on sentences set by the speaker by the speaker setting module 210; And
And a voice style setting module (230) for setting a voice style for sentences with emotions set by the emotion setting module (220).

The method of claim 9, wherein the voice montage generation unit 300,
A learning module that trains a multi-speaker speech synthesizer using a Hidden Markov Model (HMM) or deep learning; And
And a voice montage generation module that generates a voice montage using the speaker parameter, the feature parameter set through the emotion setting module, and the voice style setting module and the multi-speaker speech synthesizer learned by the learning module. A voice montage generation system.