KR20210125677A

KR20210125677A - Device, method and computer program of generating voice for training data

Info

Publication number: KR20210125677A
Application number: KR1020200043108A
Authority: KR
Inventors: 이정준
Original assignee: 주식회사 케이티
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2021-10-19

Abstract

A device for generating a voice for learning data comprises: an image conversion part that converts a voice into a first spectrogram; a simulation part that uses a pre-learned deep learning model to generate a plurality of simulation border images similar to a border of the first spectrogram, and generates a plurality of simulation color images similar to a color of the first spectrogram; and a voice generating part that combines the plurality of simulation border images and the plurality of simulation color images to generate a plurality of second spectrograms, and converts the plurality of second spectrograms into voices to generate a plurality of voices for learning data. Therefore, the present invention is capable of reducing the cost of collecting voices for learning data in machine learning.

Description

Voice generating apparatus, method and program for training data

본 발명은 머신러닝의 학습 데이터용 음성을 생성하는 장치, 방법 및 프로그램에 관한 것이다.The present invention relates to an apparatus, method and program for generating speech for training data in machine learning.

음성을 머신러닝에 활용하기 위해서는 일정량의 학습 데이터용 음성이 확보되어야 한다. 즉, 음성머신러닝의 성능을 향상시키기 위해서는, 다양한 종류와 방대한 양의 음성을 필요로 한다. In order to use voice for machine learning, a certain amount of voice for training data must be secured. In other words, in order to improve the performance of speech machine learning, various types and vast amounts of speech are required.

특히, 사람의 비명 소리와 같은 단발성의 음성은 사람마다 높이와 세기 및 패턴 등의 특징 편차가 매우 크기 때문에 보다 방대한 양의 음성을 통해 머신러닝이 학습될 필요가 있다.In particular, since a single voice, such as a human scream, has a very large variation in characteristics such as height, intensity, and pattern for each person, machine learning needs to be learned through a larger amount of voice.

그러나, 사람의 비명 소리와 같은 음성을 수집하기 위해서는 수억 단위의 천문학적인 비용이 소모된다.However, astronomical costs of hundreds of millions of units are consumed to collect voices such as human screams.

한국공개특허공보 제10-2019-0106902호 (2019.09.18. 공개)Korean Patent Publication No. 10-2019-0106902 (published on September 18, 2019)

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 머신러닝의 학습 데이터용 음성을 수집하는 비용을 절감할 수 있는 학습 데이터용 음성 생성 장치, 방법 및 프로그램을 제공하고자 한다. The present invention is to solve the problems of the prior art described above, and to provide a voice generating apparatus for learning data, a method and a program capable of reducing the cost of collecting the voice for the learning data of machine learning.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problems to be achieved by the present embodiment are not limited to the technical problems described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 수단으로서, 본 발명의 일 실시예는, 음성을 제 1 스펙트로그램으로 변환하는 이미지 변환부; 기학습된 딥러닝 모델을 이용하여 상기 제 1 스펙트로그램의 테두리와 유사한 복수의 모사 테두리 이미지를 생성하고, 상기 제 1 스펙트로그램의 색상과 유사한 복수의 모사 색상 이미지를 생성하는 모사부; 및 상기 복수의 모사 테두리 이미지 및 상기 복수의 모사 색상 이미지를 조합하여 복수의 제 2 스펙트로그램을 생성하고, 상기 복수의 제 2 스펙트로그램을 음성으로 변환하여 복수의 학습 데이터용 음성을 생성하는 음성 생성부를 포함하는 것인, 학습 데이터용 음성 생성 장치를 제공할 수 있다.As a means for achieving the above-described technical problem, an embodiment of the present invention, an image conversion unit for converting a voice into a first spectrogram; a copying unit for generating a plurality of simulated border images similar to the border of the first spectrogram by using the pre-learned deep learning model, and generating a plurality of simulated color images similar to the color of the first spectrogram; and generating a plurality of second spectrograms by combining the plurality of simulated border images and the plurality of simulated color images, and converting the plurality of second spectrograms into voices to generate a plurality of voices for learning data. It is possible to provide a voice generating apparatus for learning data, which includes a unit.

본 발명의 다른 실시예는, 음성을 제 1 스펙트로그램으로 변환하는 이미지 변환 단계; 기학습된 딥러닝 모델을 이용하여 상기 제 1 스펙트로그램의 테두리와 유사한 복수의 모사 테두리 이미지를 생성하는 모사 단계; 기학습된 딥러닝 모델을 이용하여 상기 제 1 스펙트로그램의 색상과 유사한 복수의 모사 색상 이미지를 생성하는 모사 단계; 상기 복수의 모사 테두리 이미지 및 상기 복수의 모사 색상 이미지를 조합하여 복수의 제 2 스펙트로그램을 생성하는 단계; 상기 복수의 제 2 스펙트로그램을 음성으로 변환하여 복수의 학습 데이터용 음성을 생성하는 단계를 포함하는 것인, 학습 데이터용 음성 생성 방법 제공할 수 있다. Another embodiment of the present invention, an image conversion step of converting speech into a first spectrogram; A simulation step of generating a plurality of simulated border images similar to the border of the first spectrogram using the pre-learned deep learning model; A simulation step of generating a plurality of simulated color images similar to the color of the first spectrogram by using the pre-learned deep learning model; generating a plurality of second spectrograms by combining the plurality of simulated border images and the plurality of simulated color images; It is possible to provide a method for generating a voice for learning data, comprising the step of converting the plurality of second spectrograms into voices to generate a plurality of voices for the training data.

본 발명의 또 다른 실시예는, 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 음성을 제 1 스펙트로그램으로 변환하고, 기학습된 딥러닝 모델을 이용하여 상기 제 1 스펙트로그램의 테두리와 유사한 복수의 모사 테두리 이미지를 생성하고, 상기 제 1 스펙트로그램의 색상과 유사한 복수의 모사 색상 이미지를 생성하여 모사하고, 상기 복수의 모사 테두리 이미지 및 상기 복수의 모사 색상 이미지를 조합하여 복수의 제 2 스펙트로그램을 생성하고, 상기 복수의 제 2 스펙트로그램을 음성으로 변환하여 복수의 학습 데이터용 음성을 생성하도록 하는 명령어들의 시퀀스를 포함하는, 매체에 저장된 컴퓨터 프로그램 제공할 수 있다. Another embodiment of the present invention, when a computer program is executed by a computing device, converts a voice into a first spectrogram, and uses a pre-trained deep learning model to simulate a plurality of similar to the border of the first spectrogram A border image is generated, a plurality of simulated color images similar to the color of the first spectrogram are generated and simulated, and a plurality of second spectrograms are generated by combining the plurality of simulated border images and the plurality of simulated color images. and a sequence of instructions for converting the plurality of second spectrograms into speech to generate speech for a plurality of learning data, the computer program stored in the medium may be provided.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary, and should not be construed as limiting the present invention. In addition to the exemplary embodiments described above, there may be additional embodiments described in the drawings and detailed description.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 최소한의 학습 데이터용 음성으로부터 방대한 양의 음성을 인공적으로 합성함으로써 머신러닝의 학습 데이터용 음성을 수집하는 비용을 절감할 수 있다.According to any one of the above-described problem solving means of the present invention, it is possible to reduce the cost of collecting speech for machine learning training data by artificially synthesizing a vast amount of speech from the minimum speech for training data.

도 1은 본 발명의 일 실시예에 따른 학습 데이터용 음성 생성 장치의 구성도이다.
도 2는 본 발명의 일 실시예에 따른 음성을 설명하기 위한 예시적인 도면이다.
도 3은 본 발명의 일 실시예에 따른 분할된 테두리 및 색상 이미지를 예시적으로 도시한 도면이다.
도 4는 본 발명의 일 실시예에 따른 테두리 모사부를 설명하기 위한 예시적인 도면이다.
도 5는 본 발명의 일 실시예에 따른 색상 모사부를 설명하기 위한 예시적인 도면이다.
도 6은 본 발명의 일 실시예에 따른 음성 생성부를 설명하기 위한 예시적인 도면이다.
도 7은 본 발명의 일 실시예에 따른 학습 데이터용 음성을 생성하는 방법의 순서도이다.1 is a block diagram of an apparatus for generating a voice for learning data according to an embodiment of the present invention.
2 is an exemplary diagram for explaining a voice according to an embodiment of the present invention.
3 is a diagram exemplarily illustrating a divided border and a color image according to an embodiment of the present invention.
Figure 4 is an exemplary view for explaining the edge simulating part according to an embodiment of the present invention.
5 is an exemplary view for explaining a color simulating unit according to an embodiment of the present invention.
6 is an exemplary view for explaining a voice generator according to an embodiment of the present invention.
7 is a flowchart of a method of generating a voice for training data according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement them. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. Throughout the specification, when a part is "connected" with another part, this includes not only the case of being "directly connected" but also the case of being "electrically connected" with another element interposed therebetween. . Also, when a part "includes" a component, it means that other components may be further included, rather than excluding other components, unless otherwise stated, and one or more other features However, it is to be understood that the existence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded in advance.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다.In this specification, a "part" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. In addition, one unit may be implemented using two or more hardware, and two or more units may be implemented by one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다.Some of the operations or functions described as being performed by the terminal or device in the present specification may be instead performed by a server connected to the terminal or device. Similarly, some of the operations or functions described as being performed by the server may also be performed in a terminal or device connected to the server.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 학습 데이터용 음성 생성 장치의 구성도이다. 도 1은 학습 데이터용 음성 생성 장치(100)에 의하여 제어될 수 있는 최소한의 구성요소들을 예시적으로 도시한 것이다. 1 is a block diagram of an apparatus for generating a voice for learning data according to an embodiment of the present invention. FIG. 1 exemplarily shows minimal components that can be controlled by the apparatus 100 for generating speech for training data.

도 1을 참조하면, 학습 데이터용 음성 생성 장치(100)는 데이터 베이스(110), 이미지 변환부(120), 모사부(130), 음성 생성부(140) 및 학습부(150)를 포함할 수 있다. 모사부(130)는 테두리 모사부(132) 및 색상 모사부(134)를 포함할 수 있다. Referring to FIG. 1 , the voice generating apparatus 100 for training data may include a database 110 , an image converting unit 120 , a copying unit 130 , a voice generating unit 140 , and a learning unit 150 . can The imitation part 130 may include an edge imitation part 132 and a color imitation part 134 .

학습 데이터용 음성 생성 장치(100)는 수집된 음성을 기초로 하여 유사한 특징을 지니는 복수의 학습 데이터용 인공 음성을 생성할 수 있다.The apparatus 100 for generating a voice for training data may generate a plurality of artificial voices for training data having similar characteristics based on the collected voice.

데이터 베이스(110)는 다양한 경로를 통해 수집된 음성을 저장하고 관리할 수 있다. 또한, 데이터 베이스(110)는 생성된 학습 데이터용 인공 음성을 저장하고 관리할 수 있다. 예를 들어, 본원에 있어서, 음성은 사람의 비명 소리, 문이 닫히는 소리 등의 단발성 음성일 수 있으나 반드시 이에 한정되는 것은 아니다.The database 110 may store and manage voices collected through various paths. In addition, the database 110 may store and manage the generated artificial voice for learning data. For example, in the present application, the voice may be a single voice, such as a scream of a person or a sound of a door closing, but is not necessarily limited thereto.

또한, 데이터 베이스(110)는 학습 데이터용 인공 음성을 생성하기 위한 최소한의 음성만을 수집할 수 있다. Also, the database 110 may collect only a minimum number of voices for generating artificial voices for training data.

데이터 베이스(110)는 수집된 음성 중 타겟 음성인 단발성 음성만을 분리하여 해당 데이터를 타겟 음성과 비 타겟 음성으로 구분하여 관리할 수 있다. The database 110 may separate only a single voice, which is a target voice, from among the collected voices, and manage the corresponding data by dividing the data into a target voice and a non-target voice.

데이터 베이스(110)는 수집된 타겟 음성에 대한 음성을 통해 신경망 모델을 학습시키기 위해, 타겟 음성 이외 음성 즉, 비 타겟 음성을 일종의 'False Data'로 사용할 수 있다. The database 110 may use a non-target voice, that is, a non-target voice, as a kind of 'False Data' in order to train the neural network model through the collected voice for the target voice.

데이터 베이스(110)는 데이터의 양을 보완하기 위해, 분리된 타겟 음성을 학습용과 모사용으로 구분하여 관리할 수 있다. 데이터 베이스(110)는 구분된 학습용과 모사용 타겟 음성을 수시로 전환시켜, 타겟 음성에 대한 데이터를 배깅(Bagging: bootstrap aggregating)하는 효과를 부여할 수 있다. In order to supplement the amount of data, the database 110 may manage the separated target voice by dividing it into learning and imitation. The database 110 may provide an effect of bootstrap aggregating (Bagging: bootstrap aggregating) data for the target voice by frequently switching the divided target voices for learning and imitation.

예를 들어, 데이터 베이스(110)에 사람의 비명 소리 A, B, C, D가 저장되어 있고 이를 활용하여 딥러닝 모델을 학습시킨다고 가정하면, 데이터 베이스(110)는 A와 B를 학습용으로 C와 D를 모사용으로 활용할 수 있다. 이때, 학습 데이터용 음성 장치(100)는 A와 B를 딥러닝 모델에 학습시키고 생성한 모사 결과물이 C 또는 D와 유사하다면 모사 결과가 우수한 것으로 평가할 수 있다. 또한, 데이터 베이스(110)는 A와 C를 학습용으로 B와 D를 모사용으로 활용할 수도 있으며, 이를 활용하여 딥러닝 모델을 학습시키고 생성한 모사 결과물을 비교 판단할 수 있다. 이와 같이, 데이터 베이스(110)는 저장 및 관리되고 있는 같은 형태의 데이터를 변형시켜 반복적으로 딥러닝 모델을 학습시킴으로써 적은 데이터로 많은 이종의 데이터를 학습하는 효과를 발휘할 수 있다. For example, if it is assumed that human screams A, B, C, and D are stored in the database 110 and a deep learning model is trained using them, the database 110 sets A and B to C for learning. and D can be used as imitations. At this time, the voice apparatus 100 for training data trains A and B on the deep learning model and if the generated simulation result is similar to C or D, the simulation result may be evaluated as excellent. In addition, the database 110 may use A and C for learning and B and D for imitation, and using this, it is possible to learn a deep learning model and compare and determine the generated simulation result. In this way, the database 110 can exhibit the effect of learning a lot of heterogeneous data with little data by repeatedly learning the deep learning model by modifying the same type of data that is stored and managed.

이미지 변환부(120)는 음성을 제 1 스펙트로그램으로 변환할 수 있다. 예를 들어, 이미지 변환부(120)는 사람의 비명 소리, '끼익'거리는 나무문 소리 및 '쿵'하고 닫히는 철문 소리와 같은 단발성 음성을 제 1 스펙트로그램으로 변환할 수 있다. The image converter 120 may convert the voice into the first spectrogram. For example, the image conversion unit 120 may convert a single voice, such as a scream of a person, a sound of a wooden door squeaking, and a sound of an iron door closing with a 'thump', into the first spectrogram.

스펙트로그램(Spectrogram)은 파형(Waveform)과 스펙트럼(Spectrum)의 특징으로 조합될 수 있다. 스펙트로그램 이미지는 X축은 시간, Y축은 진폭, Z축은 주파수로 구성된 3차원 데이터이다. 스펙트로그램 이미지는 시간축과 주파수축의 변화에 따른 진폭의 차이가 인쇄 농도 및 표시 색상의 차이로 표시될 수 있다.A spectrogram may be combined with features of a waveform and a spectrum. A spectrogram image is three-dimensional data composed of time on the X-axis, amplitude on the Y-axis, and frequency on the Z-axis. In the spectrogram image, a difference in amplitude according to a change in a time axis and a frequency axis may be displayed as a difference in print density and display color.

이미지 변환부(120)는 제 1 스펙트로그램을 제 1 스펙트로그램의 테두리를 표현한 테두리 이미지 및 제 1 스펙트로그램의 색상을 표현한 색상 이미지로 분할할 수 있다. The image converter 120 may divide the first spectrogram into a border image expressing the border of the first spectrogram and a color image expressing the color of the first spectrogram.

도 3을 참조하면, 이미지 변환부(120)는 제 1 스펙트로그램(예를 들어, 도 2의 200)을 제 1 스펙트로그램의 테두리를 표현한 테두리 이미지(310) 및 제 1 스펙트로그램의 색상을 표현한 색상 이미지(320)로 분할할 수 있다.Referring to FIG. 3 , the image converter 120 converts the first spectrogram (eg, 200 in FIG. 2 ) to a border image 310 expressing the border of the first spectrogram and the color of the first spectrogram. It can be divided into color images 320 .

이미지 변환부(120)는 제 1 스펙트로그램을 테두리 및 색상으로 분할하여, 모사부(130)에서 딥러닝 모델에 분할된 제 1 스펙트로그램을 입력하여 모사 이미지를 인공적으로 생성할 수 있도록 한다.The image converting unit 120 divides the first spectrogram into borders and colors, and inputs the divided first spectrogram to the deep learning model in the replicating unit 130 to artificially generate a simulated image.

모사부(130)는 기학습된 딥러닝 모델을 이용하여 제 1 스펙트로그램의 테두리와 유사한 복수의 모사 테두리 이미지를 생성하고, 제 1 스펙트로그램의 색상과 유사한 복수의 모사 색상 이미지를 생성할 수 있다. 모사부(130)는 테두리 모사부(132) 및 색상 모사부(134)를 포함할 수 있다. The replica unit 130 may generate a plurality of simulated border images similar to the border of the first spectrogram by using the pre-learned deep learning model, and may generate a plurality of simulated color images similar to the color of the first spectrogram. . The imitation part 130 may include an edge imitation part 132 and a color imitation part 134 .

음성 생성부(140)는 복수의 모사 테두리 이미지 및 복수의 모사 색상 이미지를 조합하여 복수의 제 2 스펙트로그램을 생성하고, 복수의 제 2 스펙트로그램을 음성으로 변환하여 복수의 학습 데이터용 음성을 생성할 수 있다. The voice generator 140 generates a plurality of second spectrograms by combining a plurality of simulated border images and a plurality of simulated color images, and converts the plurality of second spectrograms into voices to generate a plurality of voices for learning data. can do.

학습부(150)는 복수의 테두리 이미지(310)를 제 1 딥러닝 모델에 입력하여 복수의 모사 테두리 이미지를 생성하도록 제 1 딥러닝 모델을 학습시키고, 복수의 색상 이미지를 제 2 딥러닝 모델에 입력하여 복수의 모사 색상 이미지를 생성하도록 제 2 딥러닝 모델을 학습시킬 수 있다. The learning unit 150 learns the first deep learning model to generate a plurality of simulated border images by inputting a plurality of border images 310 into the first deep learning model, and applies the plurality of color images to the second deep learning model. The second deep learning model can be trained to generate a plurality of simulated color images by input.

잠시 도 2를 참조하면, 음성은, 도 2에 도시된 것과 같은 이미지로 표현될 수 있다. 특히, 사람의 비명 소리와 같은 단발성 음성은 특정 시간에 갑작스럽게 사운드가 발생하는 특징이 있다. 이러한 단발성 음성은 시간축(x축)에 수직으로 날카로운 절단면(도면부호 200을 참조)을 포함할 수 있다. 또한, 단발성 음성은 단발성이기 때문에 유한한 범위 내에서 이미지의 우측면(도면부호 200을 참조)이 끝난다. Referring briefly to FIG. 2 , a voice may be expressed as an image as shown in FIG. 2 . In particular, a single voice, such as a scream of a person, has a characteristic that a sound is suddenly generated at a specific time. Such a one-shot voice may include a sharp cut surface (refer to reference numeral 200) perpendicular to the time axis (x-axis). Also, since the single speech is single, the right side of the image (refer to reference numeral 200) ends within a finite range.

단발성 음성은 사람의 비명 소리라는 동일한 카테고리에 속해있다 하더라도, 파형 이미지를 육안으로 확인할 경우, 형태나 색상이 상이할 수 있다. 이러한 단발성 음성은 사람이 직접 파형 이미지를 보고 음성에 대한 정보를 인지하거나 판단하기 매우 어렵다. 따라서, 인공적으로 비명 소리와 같은 음성을 생성하는 것은 어렵다. Even if a single voice belongs to the same category as a human scream, the shape or color may be different when the waveform image is visually checked. In such a single voice, it is very difficult for a person to recognize or judge information about the voice by directly viewing the waveform image. Therefore, it is difficult to artificially generate a voice such as a scream.

모사부(130)는 기학습된 딥러닝 모델을 이용하여 제 1 스펙트로그램 이미지를 모사할 수 있다. 여기서, 딥러닝 모델은 생성적 적대 신경망(Generative Adversarial Network, GAN)일 수 있다. 예를 들어, 딥러닝 모델은 심층 합성곱 생성적 적대 신경망(Deep Convolutional Generative Adversarial Network, DCGAN)일 수 있다.The replicating unit 130 may replicate the first spectrogram image by using the pre-learned deep learning model. Here, the deep learning model may be a generative adversarial network (GAN). For example, the deep learning model may be a Deep Convolutional Generative Adversarial Network (DCGAN).

모사부(130)는 생성적 적대 신경망에 기초하여 제 1 스펙트로그램의 테두리 및 색상과 유사한 복수의 모사 이미지를 생성할 수 있다. 생성적 적대 신경망은 심층 신경망에 경쟁적 학습이 적용된 신경망이다. The replica unit 130 may generate a plurality of replica images similar to the borders and colors of the first spectrogram based on the generative adversarial neural network. A generative adversarial neural network is a neural network in which competitive learning is applied to a deep neural network.

생성적 적대 신경망은 실제 이미지로부터 랜덤으로 추출된 패치 단위의 샘플을 이용하여 유사 샘플을 생성하는 것을 학습하고 생성된 유사 샘플의 진위 여부를 판별하도록 학습하는 모델이다. A generative adversarial neural network is a model that learns to generate a similar sample using a patch unit sample randomly extracted from a real image and learns to determine the authenticity of the generated similar sample.

이러한, 생성적 적대 신경망은 실제 이미지로부터 유사 샘플을 생성하는 생성자 및 유사 샘플의 진위 여부를 판별하는 판별자를 포함하고, 생성자 및 판별자 각각은 학습하면서 서로에게 영향을 미칠 수 있다. Such a generative adversarial neural network includes a generator that generates a similar sample from a real image and a discriminator that determines the authenticity of the similar sample, and each of the generator and the discriminator can influence each other while learning.

생성자의 구조를 살펴보면, 생성자는 64개의 채널을 가진 완전히 연결된 합성곱 계층, 배치 정규화 계층 및 ReLU(Rectified Linear Unit) 계층을 8번 쌓는 구조로 형성될 수 있다. 이 때, 적대적 생성 신경망 구조의 전체 파라미터의 수를 줄이기 위해 깊이 합성곱 계층이 생성자의 구조에 사용된다. 여기서, 깊이 합성곱 계층은 각 채널의 결과값을 하나로 합쳐주는 특징을 가지고 있기 때문에 각 계층의 특성을 유지함과 동시에 파라미터의 수를 줄일 수 있다. Looking at the structure of the constructor, the constructor can be formed by stacking a fully connected convolutional layer with 64 channels, a batch normalization layer, and a Rectified Linear Unit (ReLU) layer 8 times. At this time, a depth convolutional layer is used in the structure of the generator to reduce the total number of parameters in the structure of the adversarial generative neural network. Here, since the depth convolution layer has a feature of combining the result values of each channel into one, it is possible to reduce the number of parameters while maintaining the characteristics of each layer.

판별자의 구조를 살펴보면, 판별자는 완전히 연결된 컨볼루션 계층, 배치 정규화 계층 및 Leaky ReLU 계층을 5번 쌓는 구조로 형성될 수 있다. Looking at the structure of the discriminator, the discriminator can be formed in a structure in which a fully connected convolutional layer, a batch normalization layer, and a Leaky ReLU layer are stacked 5 times.

한편, 심층 합성곱 생성적 적대 신경망은 사물의 경계선을 학습시키는 데 탁월한 능력을 가진 딥러닝 모델이다. 또한, 심층 합성곱 생성적 적대 신경망은 더미 파라미터 수치를 조절하여 유사한 이미지를 다량으로 생성할 수도 있다. On the other hand, deep convolutional generative adversarial neural networks are deep learning models with excellent ability to learn the boundaries of objects. In addition, deep convolutional generative adversarial neural networks can generate similar images in large numbers by adjusting dummy parameter values.

이와 같이, 심층 합성곱 생성적 적대 신경망은 특정 조건을 만족하는 이미지 경계선 및 색상의 패턴 학습에 활용될 수 있다. 또한, 심층 합성곱 생성적 적대 신경망은 학습을 통해 이미지의 경계선 및 색상의 패턴을 분석하여 인공적으로 유사 이미지를 생성하도록 학습시킬 수 있다. As such, the deep convolutional generative adversarial neural network can be utilized to learn patterns of image boundaries and colors that satisfy specific conditions. In addition, the deep convolutional generative adversarial neural network can be trained to artificially generate similar images by analyzing the patterns of borders and colors of images through learning.

심층 합성곱 생성적 적대 신경망은 경계선 패턴 및 색상 패턴에 대한 두 학습이 모두 완료된 경우, 수치화된 파라미터를 조절하여 연속성 있는 인공 이미지를 생성할 수 있다. The deep convolutional generative adversarial neural network can generate continuous artificial images by adjusting the numerical parameters when both learning of the border pattern and the color pattern are completed.

이러한 심층 합성곱 생성적 적대 신경망은 음성의 스펙트로그램을 학습하고 스펙트로그램과 유사한 인공 이미지를 합성해낼 수 있다. 합성된 인공 이미지를 다시 음성으로 변환시키면, 실제 음성과 매우 유사하면서도 다양한 특징을 지니는 음성이 생성될 수 있다. Such a deep convolutional generative adversarial neural network can learn the spectrogram of speech and synthesize an artificial image similar to the spectrogram. When the synthesized artificial image is converted back into speech, speech that is very similar to real speech and has various characteristics can be generated.

예를 들어, 심층 합성곱 생성적 적대 신경망은 입력값 파라미터를 조절하여 '끼익'거리는 나무문과 '쿵'하는 철문 소리에 대한 스펙트로그램으로부터 중간 소재의 문이 둔탁하게 닫히는 소리부터 가볍게 닫히는 소리 등과 같은 다양한 소리에 대한 스펙트로그램을 생성할 수 있다.For example, a deep convolutional generative adversarial neural network modulates input parameters to detect the sound of an intermediate door from blunt to light closing, from spectrograms for 'squeaking' wooden doors and 'thumping' iron doors. You can create spectrograms for various sounds.

모사부(130)는, 제 1 딥러닝 모델에 테두리 이미지를 입력하여 복수의 모사 테두리 이미지를 생성하는 테두리 모사부(132)를 포함할 수 있다.The replica unit 130 may include a frame replica unit 132 for generating a plurality of replica frame images by inputting an edge image to the first deep learning model.

테두리 모사부(132)는 제 1 딥러닝 모델에 이미지 변환부(120)에 의해 분할된 테두리 이미지를 입력하여 복수의 모사 테두리 이미지를 인공적으로 생성할 수 있다. The frame replicating unit 132 may artificially generate a plurality of replica frame images by inputting the frame image divided by the image conversion unit 120 into the first deep learning model.

모사부(130)는 제 2 딥러닝 모델에 색상 이미지를 입력하여 복수의 모사 색상 이미지를 생성하는 색상 모사부(134)를 포함할 수 있다. The replica unit 130 may include a color replica unit 134 for generating a plurality of replica color images by inputting a color image to the second deep learning model.

색상 모사부(134)는 제 2 딥러닝 모델에 이미지 변환부(120)에 의해 분할된 색상 이미지를 입력하여 복수의 모사 색상 이미지를 인공적으로 생성할 수 있다.The color simulating unit 134 may artificially generate a plurality of simulated color images by inputting the color image divided by the image converting unit 120 into the second deep learning model.

도 4는 본 발명의 일 실시예에 따른 테두리 모사부를 설명하기 위한 예시적인 도면이다. Figure 4 is an exemplary view for explaining the edge simulating part according to an embodiment of the present invention.

먼저, 학습부(150)는 데이터 베이스(110)의 타겟 음성을 바탕으로 해당 타겟 음성이 변환된 제 1 스펙트로그램에서 어떠한 형태의 경계선 패턴을 보이는지를 분석하고, 이를 모사할 수 있도록 제 1 딥러닝 모델(410)을 학습시킬 수 있다. First, the learning unit 150 analyzes what type of boundary line pattern is shown in the first spectrogram in which the target voice is converted based on the target voice of the database 110, and the first deep learning to simulate it. The model 410 may be trained.

예를 들어, 학습부(150)는 테두리 이미지(420)를 제 1 딥러닝 모델(410)에 입력하여, 테두리 이미지(420)와 유사한 복수의 모사 테두리 이미지를 생성하도록 제 1 딥러닝 모델(410)을 학습시킬 수 있다. For example, the learning unit 150 inputs the border image 420 into the first deep learning model 410, and the first deep learning model 410 to generate a plurality of simulated border images similar to the border image 420. ) can be learned.

이후, 테두리 모사부(132)는 학습된 제 1 딥러닝 모델(410)에 기초하여 테두리 이미지(420)와 유사한 복수의 모사 테두리 이미지(430)를 인공적으로 생성하여 출력할 수 있다.Thereafter, the border replicating unit 132 may artificially generate and output a plurality of simulated border images 430 similar to the border image 420 based on the learned first deep learning model 410 .

예를 들어, 테두리 모사부(132)는 3개의 봉우리를 특징으로 하는 테두리 이미지(420)를 학습된 제 1 딥러닝 모델(410)에 입력하여, 3개의 봉우리를 가진 테두리 이미지(420)와 유사한 3개의 봉우리를 특징으로 하는 복수의 모사 테두리 이미지(430)를 생성하여 출력할 수 있다. For example, the border replica 132 inputs a border image 420 featuring three peaks to the trained first deep learning model 410, similar to the border image 420 with three peaks. It is possible to generate and output a plurality of simulated border images 430 featuring three peaks.

도 5는 본 발명의 일 실시예에 따른 색상 모사부를 설명하기 위한 예시적인 도면이다. 5 is an exemplary view for explaining a color simulating unit according to an embodiment of the present invention.

먼저, 학습부(150)는, 데이터 베이스(110)의 타겟 음성을 바탕으로 해당 타겟 음성이 변환된 제 1 스펙트로그램에서 어떠한 형태의 에너지 패턴을 보이는지 분석하고, 이를 모사할 수 있도록 제 2 딥러닝 모델(510)을 학습시킬 수 있다.First, the learning unit 150 analyzes what type of energy pattern the target voice shows in the converted first spectrogram based on the target voice of the database 110, and the second deep learning to simulate it. The model 510 may be trained.

예를 들어, 학습부(150)는 색상 이미지(520)를 제 2 딥러닝 모델(510)에 입력하여, 색상 이미지(520)와 유사한 복수의 모사 색상 이미지를 생성하도록 제 2 딥러닝 모델(510)을 학습시킬 수 있다. For example, the learning unit 150 inputs the color image 520 into the second deep learning model 510 to generate a plurality of simulated color images similar to the color image 520 . ) can be learned.

이후, 색상 모사부(134)는 학습된 제 2 딥러닝 모델(510)에 기초하여 색상 이미지(520)와 유사한 복수의 모사 색상 이미지(530)를 인공적으로 생성하여 출력할 수 있다.Thereafter, the color simulation unit 134 may artificially generate and output a plurality of simulated color images 530 similar to the color image 520 based on the learned second deep learning model 510 .

도 6은 본 발명의 일 실시예에 따른 음성 생성부를 설명하기 위한 예시적인 도면이다. 6 is an exemplary view for explaining a voice generator according to an embodiment of the present invention.

도 6을 참조하면, 음성 생성부(140)는 제 1 스펙트로그램에 기초하여 생성된 복수의 모사 테두리 이미지(610) 및 복수의 모사 색상 이미지(620)를 조합하여 복수의 제 2 스펙트로그램(630)을 생성하고, 복수의 제 2 스펙트로그램(630)에 기초하여 복수의 학습 데이터용 음성(640)을 생성할 수 있다.Referring to FIG. 6 , the voice generator 140 combines a plurality of simulated border images 610 and a plurality of simulated color images 620 generated based on the first spectrogram to obtain a plurality of second spectrograms 630 . ) and may generate a plurality of voices 640 for training data based on the plurality of second spectrograms 630 .

예를 들어, 음성 생성부(140)는 테두리 모사부(132) 및 색상 모사부(134)를 거쳐 생성된 결과물(610, 620)을 통합할 수 있다. 구체적으로, 음성 생성부(140)는 제 1 스펙트로그램에 기초하여 생성된 복수의 모사 테두리 이미지(610)및 복수의 모사 색상 이미지(620)를 병렬적으로 통합하여 제 2 스펙트로그램(630)을 생성할 수 있다. For example, the voice generating unit 140 may integrate the results 610 and 620 generated through the edge replicating unit 132 and the color replicating unit 134 . Specifically, the voice generator 140 generates a second spectrogram 630 by parallelly integrating a plurality of simulated border images 610 and a plurality of simulated color images 620 generated based on the first spectrogram. can create

음성 생성부(140)는 제 2 스펙트로그램(630)을 단일화시키고, 단일화된 제 2 스펙트로그램(630)을 역연산하여 음성화하여 복수의 학습 데이터용 음성(640)을 생성할 수 있다. The voice generator 140 may generate a plurality of voices 640 for training data by unifying the second spectrogram 630 and performing an inverse operation on the unified second spectrogram 630 to make the voices.

예를 들어, 음성 생성부(140)는 테두리 모사부(132)와 색상 모사부(134)의 결과물을 통합하여 생성한 제 2 스펙트로그램을 소리 파형으로 역 코딩하여 음성 파일로 출력할 수 있다. 구체적으로, 음성 생성부(140)는 생성된 테두리 모사부(132)와 색상 모사부(134)의 결과물을 조합하여 인공파를 만들고, 해당 인공파를 병렬적으로 조합 및 복원해서 하나의 음성 파일을 만들고, 해당 음성 파일을 출력할 수 있다. For example, the voice generating unit 140 may inversely code the second spectrogram generated by integrating the results of the edge replicating unit 132 and the color replicating unit 134 into a sound waveform and outputting it as a voice file. Specifically, the voice generating unit 140 creates an artificial wave by combining the results of the generated border replicating unit 132 and the color replicating unit 134, and combining and restoring the corresponding artificial waves in parallel to form one voice file. can be created and output the corresponding audio file.

도 7은 본 발명의 일 실시예에 따른 학습 데이터용 음성을 생성하는 방법의 순서도이다. 도 7에 도시된 학습 데이터용 음성을 생성하는 방법은 도1 내지 도 6에 도시된 실시예에 따라 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이라고 하더라도 도1 내지 도 6에 도시된 실시예에 따른 학습 데이터용 음성 생성 서버에서 학습 데이터용 음성을 생성하는 방법에도 적용된다. 7 is a flowchart of a method of generating a voice for training data according to an embodiment of the present invention. The method for generating a voice for training data shown in FIG. 7 includes steps processed in time series according to the embodiments shown in FIGS. 1 to 6 . Therefore, even if omitted below, it is also applied to the method of generating the voice for the training data in the voice generating server for the training data according to the embodiment shown in FIGS. 1 to 6 .

단계 S710에서 학습 데이터용 음성 생성 장치는 음성을 제 1 스펙트로그램으로 변환 할 수 있다.In step S710, the apparatus for generating speech for training data may convert speech into a first spectrogram.

단계 S720에서 학습 데이터용 음성 생성 장치는 기학습된 딥러닝 모델을 이용하여 제 1 스펙트로그램의 테두리와 유사한 복수의 모사 테두리 이미지를 생성할 수 있다.In step S720, the apparatus for generating speech for training data may generate a plurality of simulated border images similar to the borders of the first spectrogram by using the pre-learned deep learning model.

단계 S730에서 학습 데이터용 음성 생성 장치는 기학습된 딥러닝 모델을 이용하여 제 1 스펙트로그램의 색상과 유사한 복수의 모사 색상 이미지를 생성할 수 있다.In step S730, the apparatus for generating speech for training data may generate a plurality of simulated color images similar to the color of the first spectrogram by using the pre-learned deep learning model.

단계 S740에서 학습 데이터용 음성 생성 장치는 복수의 모사 테두리 이미지 및 복수의 모사 색상 이미지를 조합하여 복수의 제 2 스펙트로그램을 생성할 수 있다. In step S740, the apparatus for generating speech for training data may generate a plurality of second spectrograms by combining a plurality of simulated border images and a plurality of simulated color images.

단계 S750에서 학습 데이터용 음성 생성 장치는 복수의 제 2 스펙트로그램을 음성으로 변환하여 복수의 학습 데이터용 음성을 생성할 수 있다. In operation S750, the apparatus for generating speech for training data may generate a plurality of speech for training data by converting the plurality of second spectrograms into speech.

상술한 설명에서, 단계 S710 내지 S750는 본 발명의 구현 예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 전환될 수도 있다. In the above description, steps S710 to S750 may be further divided into additional steps or combined into fewer steps according to an embodiment of the present invention. In addition, some steps may be omitted if necessary, and the order between the steps may be switched.

도 1 내지 도 6을 통해 설명된 학습 데이터용 음성 생성 서버에서 학습 데이터용 음성을 생성하는 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램 또는 컴퓨터에 의해 실행 가능한 명령어들을 포함하는 기록 매체의 형태로도 구현될 수 있다. 또한, 도 1 내지 도 6을 통해 설명된 학습 데이터용 음성 생성 서버에서 학습 데이터용 음성을 생성하는 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램의 형태로도 구현될 수 있다.The method for generating a voice for learning data in the voice generating server for learning data described through FIGS. 1 to 6 is in the form of a recording medium including a computer program stored in a medium executed by a computer or instructions executable by a computer. can also be implemented. In addition, the method for generating a voice for learning data in the voice generating server for learning data described with reference to FIGS. 1 to 6 may be implemented in the form of a computer program stored in a medium executed by a computer.

컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.Computer-readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer-readable media may include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The description of the present invention described above is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. do.

100: 학습 데이터용 음성 생성 장치
110: 데이터 베이스
120: 이미지 변환부
130: 모사부
140: 음성 생성부
150: 학습부100: speech generating device for training data
110: database
120: image conversion unit
130: copycat
140: voice generator
150: study department

Claims

An apparatus for generating speech for machine learning training data, the apparatus comprising:
an image converter for converting voice into a first spectrogram;
a copying unit for generating a plurality of simulated border images similar to the border of the first spectrogram by using the pre-learned deep learning model, and generating a plurality of simulated color images similar to the color of the first spectrogram; and
A voice generator for generating a plurality of second spectrograms by combining the plurality of simulated border images and the plurality of simulated color images, and converting the plurality of second spectrograms into voices to generate a plurality of voices for learning data
A voice generating device for training data that includes.

The method of claim 1,
The image converting unit divides the first spectrogram into an edge image expressing an edge of the first spectrogram and a color image expressing a color of the first spectrogram.

3. The method of claim 2,
The imitation unit,
A voice generating apparatus for learning data, comprising a frame replica for generating the plurality of simulated frame images by inputting the frame image to the first deep learning model.

4. The method of claim 3,
The imitation unit,
A color simulating unit for generating the plurality of simulated color images by inputting the color image to a second deep learning model
Which will further include a voice generating device for training data.

5. The method of claim 4,
By inputting a plurality of the border images into the first deep learning model, the first deep learning model is trained to generate the plurality of simulated border images, and a plurality of the color images are input to the second deep learning model, and the A learning unit that trains the second deep learning model to generate a plurality of simulated color images
Which will further include a voice generating device for training data.

The method of claim 1,
The deep learning model is a generative adversarial network (GAN), a voice generating device for training data.

A method for generating speech for machine learning training data, the method comprising:
an image conversion step of converting speech into a first spectrogram;
A simulation step of generating a plurality of simulated border images similar to the border of the first spectrogram using the pre-learned deep learning model;
A simulation step of generating a plurality of simulated color images similar to the color of the first spectrogram by using the pre-learned deep learning model;
generating a plurality of second spectrograms by combining the plurality of simulated border images and the plurality of simulated color images;
converting the plurality of second spectrograms into voices to generate a plurality of voices for learning data
A method for generating speech for training data, comprising:

8. The method of claim 7,
In the image conversion step, the first spectrogram is divided into an edge image expressing an edge of the first spectrogram and a color image expressing a color of the first spectrogram.

9. The method of claim 8,
The simulation step is
A method for generating speech for training data, including a frame simulating step of generating the plurality of simulated border images by inputting the border image to the first deep learning model.

10. The method of claim 9,
The simulation step is
The method for generating voice for training data further comprising a color simulating step of generating the plurality of simulated color images by inputting the color image to a second deep learning model.

11. The method of claim 10,
By inputting a plurality of the border images into the first deep learning model, the first deep learning model is trained to generate the plurality of simulated border images, and a plurality of the color images are input to the second deep learning model, and the training the second deep learning model to generate a plurality of simulated color images.
The method of generating a voice for training data further comprising a.

8. The method of claim 7,
The deep learning model is a generative adversarial network (GAN), a voice generation method for training data.

A computer program stored on a computer readable medium comprising a sequence of instructions for generating speech for machine learning training data, the computer program comprising:
When the computer program is executed by a computing device,
converting speech into a first spectrogram,
A plurality of simulated border images similar to the border of the first spectrogram are generated using the pre-learned deep learning model, and a plurality of simulated color images similar to the color of the first spectrogram are generated and simulated,
Commands for generating a plurality of second spectrograms by combining the plurality of simulated border images and the plurality of simulated color images, and converting the plurality of second spectrograms into voices to generate a plurality of voices for learning data A computer program stored on a medium comprising a sequence.