KR102480360B1

KR102480360B1 - Apparatus, method and computer program for generating synthesized sound source using learning through image

Info

Publication number: KR102480360B1
Application number: KR1020190105131A
Authority: KR
Inventors: 권순구; 최우혁
Original assignee: 주식회사 케이티
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2022-12-22
Also published as: KR20230006629A; KR20210025295A

Abstract

이미지를 통한 학습을 이용하여 합성 음원을 생성하는 장치는 사용자의 음성이 녹음된 샘플 음원 및 합성 음원의 생성을 위한 텍스트를 포함하는 입력 데이터를 입력받는 입력부, 상기 입력된 샘플 음원을 상기 샘플 음원에 대한 주파수 특성을 포함하는 주파수 스펙트로그램으로 변환하고, 상기 변환된 주파수 스펙트로그램에 기초하여 제 1 이미지를 생성하는 제 1 이미지 생성부, 학습 모델을 이용하여 상기 생성된 제 1 이미지로부터 상기 텍스트에 대응하는 제 2 이미지를 추론하는 제 2 이미지 추론부 및 상기 추론된 제 2 이미지 및 상기 샘플 음원에 기초하여 합성 음원을 생성하는 합성 음원 생성부를 포함한다. An apparatus for generating a synthesized sound source using learning through images includes an input unit for receiving input data including a sample sound source in which a user's voice is recorded and text for generating a synthesized sound source, and the input sample sound source is transmitted to the sample sound source. A first image generating unit that converts a frequency spectrogram into a frequency spectrogram including frequency characteristics of the converted frequency spectrogram and generates a first image based on the converted frequency spectrogram, and corresponds to the text from the generated first image using a learning model and a second image inference unit for inferring a second image to generate a synthesized sound source for generating a synthesized sound source based on the inferred second image and the sample sound source.

Description

Apparatus, method and computer program for generating synthesized sound sources using learning through images

본 발명은 이미지를 통한 학습을 이용하여 합성 음원을 생성하는 장치, 방법 및 컴퓨터 프로그램에 관한 것이다. The present invention relates to an apparatus, method, and computer program for generating a synthesized sound source using learning through images.

음성 합성 기술이란 기계가 사람의 음성을 자동으로 분석하고, 합성 음원의 생성을 위한 텍스트와 합성하여 말소리를 인위적으로 만들어내는 기술을 의미한다. 음성 합성 기술은 지하철, 버스 등에서 안내되는 목소리가 대표적이다. Speech synthesis technology refers to a technology in which a machine automatically analyzes a human voice and synthesizes it with text for generating a synthesized sound source to artificially create speech sounds. Voice synthesis technology is representative of voices used in subways and buses.

이러한 음성 합성 기술과 관련하여 선행기술인 한국등록특허 제 10-1880378호는 음성 합성 방법 및 장치를 개시하고 있다. Regarding such voice synthesis technology, Korea Patent Registration No. 10-1880378, which is a prior art, discloses a voice synthesis method and apparatus.

종래에는 사용자의 목소리를 기반으로 하는 합성 음원을 생성하기 위해서 사용자의 음성이 녹음된 음성 파일을 멜 스펙트로그램(Mel spectrogram)으로 변환한 후, 멜 스펙트로그램의 크기(magnitude)와 입력 텍스트를 머신러닝의 입력으로 넣어 학습하였다. 멜 스펙트로그램은 음성의 주파수 특성을 분석한 데이터로서, 음성의 높이 정보를 포함하고 있다.Conventionally, in order to generate a synthesized sound source based on the user's voice, a voice file in which the user's voice is recorded is converted into a Mel spectrogram, and then the magnitude of the Mel spectrogram and the input text are machine learning. It was put into the input of and learned. The MEL spectrogram is data obtained by analyzing frequency characteristics of voice, and includes voice height information.

이후, 멜 스펙트로그램을 추론하여 예를 들어, 그리핀-림 알고리즘(griffin-lim algorithm), 웨이브넷(wavenet) 등의 보코더(vocoder)를 이용하여 합성 음원 파일(wav)을 생성하였다. Thereafter, the Mel spectrogram was inferred and a synthesized sound source file (wav) was generated using a vocoder such as a griffin-lim algorithm or a wavenet, for example.

그러나 종래의 방법은 멜 스펙트로그램의 크기와 입력 텍스트를 머신러닝의 입력으로 넣어 학습함으로써, 오버피팅의 문제가 발생되었고, 멜 스펙트로그램을 추론하는데 정확도가 낮다는 문제점이 있었다. However, the conventional method has a problem of overfitting and low accuracy in inferring the Mel spectrogram by learning by inputting the size of the Mel spectrogram and the input text as input to the machine learning.

사용자의 음성이 녹음된 샘플 음원을 주파수 스펙트로그램을 통해 제 1 이미지로 변환하고, 변환된 이미지로부터 텍스트에 대응하는 제 2 이미지를 추론함으로써, 이미지를 통한 학습을 이용하여 합성 음원을 생성할 수 있도록 하는 합성 음원 장치, 방법 및 컴퓨터 프로그램을 제공하고자 한다. A sample sound source in which the user's voice is recorded is converted into a first image through a frequency spectrogram, and a second image corresponding to the text is inferred from the converted image, so that a synthesized sound source can be generated using learning through images. It is intended to provide a synthetic sound source device, method, and computer program for

추론된 제 2 이미지 및 샘플 음원에 기초하여 합성 음원을 생성할 수 있도록 하는 합성 음원 장치, 방법 및 컴퓨터 프로그램을 제공하고자 한다. An object of the present invention is to provide a synthesized sound source device, method, and computer program capable of generating a synthesized sound source based on the inferred second image and sample sound source.

합성 음원에 사용자의 음성 특징이 반영되도록 하여, 사용자의 음색을 기반으로 합성 음원이 생성되도록 하는 합성 음원 장치, 방법 및 컴퓨터 프로그램을 제공하고자 한다.An object of the present invention is to provide a synthesized sound source device, method, and computer program that generate a synthesized sound source based on the user's voice by reflecting the user's voice characteristics on the synthesized sound source.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. However, the technical problem to be achieved by the present embodiment is not limited to the technical problems described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 수단으로서, 본 발명의 일 실시예는, 사용자의 음성이 녹음된 샘플 음원 및 합성 음원의 생성을 위한 텍스트를 포함하는 입력 데이터를 입력받는 입력부, 상기 입력된 샘플 음원을 상기 샘플 음원에 대한 주파수 특성을 포함하는 주파수 스펙트로그램으로 변환하고, 상기 변환된 주파수 스펙트로그램에 기초하여 제 1 이미지를 생성하는 제 1 이미지 생성부, 학습 모델을 이용하여 상기 생성된 제 1 이미지로부터 상기 텍스트에 대응하는 제 2 이미지를 추론하는 제 2 이미지 추론부 및 상기 추론된 제 2 이미지 및 상기 샘플 음원에 기초하여 합성 음원을 생성하는 합성 음원 생성부를 포함하는 합성 음원 생성 장치를 제공할 수 있다. As a means for achieving the above-described technical problem, an embodiment of the present invention is an input unit for receiving input data including text for generating a sample sound source in which a user's voice is recorded and a synthesized sound source, and the input sample sound source A first image generating unit that converts to a frequency spectrogram including frequency characteristics of the sample sound source and generates a first image based on the converted frequency spectrogram, and the generated first image using a learning model An apparatus for generating a synthesized sound source comprising a second image inference unit for inferring a second image corresponding to the text from and a synthesized sound source generator for generating a synthesized sound source based on the inferred second image and the sample sound source. there is.

본 발명의 다른 실시예는, 사용자의 음성이 녹음된 샘플 음원 및 합성 음원의 생성을 위한 텍스트를 포함하는 입력 데이터를 입력받는 단계, 상기 입력된 샘플 음원을 상기 샘플 음원에 대한 주파수 특성을 포함하는 주파수 스펙트로그램으로 변환하는 단계, 상기 변환된 주파수 스펙트로그램에 기초하여 제 1 이미지를 생성하는 단계, 학습 모델을 이용하여 상기 생성된 제 1 이미지로부터 상기 텍스트에 대응하는 제 2 이미지를 추론하는 단계 및 상기 추론된 제 2 이미지 및 상기 샘플 음원에 기초하여 합성 음원을 생성하는 단계를 포함하는 합성 음원 생성 방법을 제공할 수 있다. Another embodiment of the present invention is a step of receiving input data including text for generating a sample sound source in which a user's voice is recorded and a synthesized sound source, the input sample sound source including frequency characteristics of the sample sound source Converting to a frequency spectrogram, generating a first image based on the converted frequency spectrogram, inferring a second image corresponding to the text from the generated first image using a learning model, and A method for generating a synthesized sound source may be provided, including generating a synthesized sound source based on the inferred second image and the sample sound source.

본 발명의 또 다른 실시예는, 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 사용자의 음성이 녹음된 샘플 음원 및 합성 음원의 생성을 위한 텍스트를 포함하는 입력 데이터를 입력받고, 상기 입력된 샘플 음원을 상기 샘플 음원에 대한 주파수 특성을 포함하는 주파수 스펙트로그램으로 변환하고, 상기 변환된 주파수 스펙트로그램에 기초하여 제 1 이미지를 생성하고, 학습 모델을 이용하여 상기 생성된 제 1 이미지로부터 상기 텍스트에 대응하는 제 2 이미지를 추론하고, 상기 추론된 제 2 이미지 및 상기 샘플 음원에 기초하여 합성 음원을 생성하도록 하는 명령어들의 시퀀스를 포함하는 매체에 저장된 컴퓨터 프로그램을 제공할 수 있다. Another embodiment of the present invention, when the computer program is executed by a computing device, receives input data including text for generating a sample sound source in which a user's voice is recorded and a synthesized sound source, and generates the input sample sound source. Converting to a frequency spectrogram including frequency characteristics of the sample sound source, generating a first image based on the converted frequency spectrogram, and corresponding to the text from the generated first image using a learning model A computer program stored in a medium including a sequence of instructions for inferring a second image and generating a synthesized sound source based on the inferred second image and the sample sound source may be provided.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described means for solving the problems is only illustrative and should not be construed as limiting the present invention. In addition to the exemplary embodiments described above, there may be additional embodiments described in the drawings and detailed description.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 사용자의 음성이 녹음된 샘플 음원을 주파수 스펙트로그램 기반의 제 1 이미지로 변환하고, 변환된 제 1 이미지 및 입력 테스트에 기초하여 제 2 이미지를 추론함으로써, 이미지를 통한 학습을 통해 합성 음원을 생성할 수 있도록 하는 합성 음원 생성 장치, 방법 및 컴퓨터 프로그램을 제공할 수 있다. According to any one of the above-described problem solving means of the present invention, a sample sound source in which a user's voice is recorded is converted into a first image based on a frequency spectrogram, and a second image is generated based on the converted first image and an input test. By reasoning, it is possible to provide a synthesized sound source generation device, method, and computer program capable of generating a synthesized sound source through learning through images.

이미지를 통한 학습을 수행함으로써, 학습 시에 발생될 수 있는 오버피팅(overfitting) 문제를 방지하고, 합성 음원을 생성하기 위해 필요한 이미지 추론 과정에 있어 정확도를 향상시킬 수 있도록 하는 합성 음원 생성 장치, 방법 및 컴퓨터 프로그램을 제공할 수 있다. Apparatus and method for generating a synthesized sound source capable of preventing an overfitting problem that may occur during learning and improving accuracy in an image inference process necessary for generating a synthesized sound source by performing learning through images and a computer program.

사용자의 음성 특징을 추출하고, 추출된 음성 특징을 합성 음원에 삽입함으로써, 사용자 음색 기반의 합성 음원을 생성할 수 있도록 하는 합성 음원 생성 장치, 방법 및 컴퓨터 프로그램을 제공할 수 있다. An apparatus, method, and computer program for generating a synthesized sound source capable of generating a synthesized sound source based on a user's tone by extracting voice characteristics of a user and inserting the extracted voice characteristics into a synthesized sound source may be provided.

도 1은 본 발명의 일 실시예에 따른 합성 음원 생성 장치의 구성도이다.
도 2a 및 도 2b는 본 발명의 일 실시예에 따른 샘플 음원으로부터 변환된 주파수 스펙트로그램을 도시한 예시적인 도면이다.
도 3은 본 발명의 일 실시예에 따른 제 1 이미지로부터 RGB 데이터를 추출하기 위한 4차원 배열을 나타낸 도면이다.
도 4는 본 발명의 일 실시예에 따른 합성 음원 생성 장치에서 이미지를 통한 학습을 이용하여 합성 음원을 생성하는 방법의 순서도이다. 1 is a block diagram of an apparatus for generating a synthesized sound source according to an embodiment of the present invention.
2A and 2B are exemplary diagrams illustrating frequency spectrograms converted from sample sound sources according to an embodiment of the present invention.
3 is a diagram showing a 4-dimensional array for extracting RGB data from a first image according to an embodiment of the present invention.
4 is a flowchart of a method for generating a synthesized sound source by using learning through images in an apparatus for generating a synthesized sound source according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail so that those skilled in the art can easily practice the present invention with reference to the accompanying drawings. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. Throughout the specification, when a part is said to be "connected" to another part, this includes not only the case where it is "directly connected" but also the case where it is "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, this means that it may further include other components, not excluding other components, unless otherwise stated, and one or more other characteristics. However, it should be understood that it does not preclude the possibility of existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다.In this specification, a "unit" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. Further, one unit may be realized using two or more hardware, and two or more units may be realized by one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다.In this specification, some of the operations or functions described as being performed by a terminal or device may be performed instead by a server connected to the terminal or device. Likewise, some of the operations or functions described as being performed by the server may also be performed in a terminal or device connected to the corresponding server.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 합성 음원 생성 장치의 구성도이다. 도 1을 참조하면, 합성 음원 생성 장치(100)는 입력부(110), 제 1 이미지 생성부(120), RGB 추출부(130), 제 2 이미지 추론부(140), 음색 특징 추출부(150) 및 합성 음원 생성부(160)를 포함할 수 있다. 1 is a block diagram of an apparatus for generating a synthesized sound source according to an embodiment of the present invention. Referring to FIG. 1 , the synthetic sound generator 100 includes an input unit 110, a first image generator 120, an RGB extraction unit 130, a second image inference unit 140, and a tone feature extraction unit 150. ) and a synthesized sound generator 160.

입력부(110)는 특정 음성이 녹음된 샘플 음원 및 합성 음원의 생성을 위한 텍스트를 포함하는 입력 데이터를 입력받을 수 있다. The input unit 110 may receive input data including text for generating a sample sound source recorded with a specific voice and a synthesized sound source.

제 1 이미지 생성부(120)는 입력된 샘플 음원을 샘플 음원에 대한 주파수 특성을 포함하는 주파수 스펙트로그램으로 변환할 수 있다. 여기서, 주파수 스펙트로그램은 멜 스펙트로그램(Mel spectrogram)일 수 있다. 샘플 음원을 주파수 스펙트로그램으로 변환하는 과정에 대해서는 도 2a 및 도 2b를 통해 상세히 설명하도록 한다. The first image generating unit 120 may convert the input sample sound source into a frequency spectrogram including frequency characteristics of the sample sound source. Here, the frequency spectrogram may be a Mel spectrogram. A process of converting a sample sound source into a frequency spectrogram will be described in detail with reference to FIGS. 2A and 2B.

도 2a 및 도 2b는 본 발명의 일 실시예에 따른 샘플 음원으로부터 변환된 주파수 스펙트로그램을 도시한 예시적인 도면이다. 2A and 2B are exemplary diagrams illustrating frequency spectrograms converted from sample sound sources according to an embodiment of the present invention.

제 1 이미지 생성부(120)는 입력된 샘플 음원을 시계열(time series) 데이터로 변환한 후, 시계열 데이터를 주파수 스펙트로그램으로 변환할 수 있다. 여기서, 시계열 데이터의 형식(type)은 부동 소수점 숫자 데이터에 사용하는 근사 숫자 데이터 형식 중 유효 자릿수가 6~7자리이고, 바이트(byte)가 4자리인 플롯(float)으로 설정될 수 있다. The first image generating unit 120 may convert the input sample sound source into time series data and then convert the time series data into a frequency spectrogram. Here, the type of time series data may be set to a float in which the number of significant digits is 6 to 7 digits and the byte is 4 digits among approximate numeric data formats used for floating point number data.

제 1 이미지 생성부(120)는 예를 들어, STFT(Short-Time Fourier Transform)를 이용하여 시계열 데이터로부터 주파수의 크기(magnitude)를 추출하고, 시간-크기 도메인(time-magnitude domain)을 시간-주파수 도메인(time-frequency domain)으로 변환하여, 주파수 스펙트로그램의 배열(array)을 생성할 수 있다. The first image generating unit 120 extracts the magnitude of a frequency from the time-series data by using, for example, Short-Time Fourier Transform (STFT), and converts the time-magnitude domain into a time-magnitude domain. An array of frequency spectrograms may be generated by transforming into a time-frequency domain.

여기서, STFT는 입력 신호를 대응하는 스펙트럼으로 변환하고, 시간 영역의 함수를 주파수 영역의 함수로 변환하는 퓨리에 트랜스폼(FT, Fourier Transform)을 실제 녹음된 소리에 적용하기 위해 만들어진 것을 의미한다. Here, the STFT means that an input signal is converted into a corresponding spectrum, and a Fourier Transform (FT) that converts a function in the time domain into a function in the frequency domain is applied to an actual recorded sound.

이 때, 사용되는 파라미터는 예를 들어, 하기와 같다.At this time, the parameters used are, for example, as follows.

[파라미터][parameter]

sample_rate=22050Hz, Channel=80sample_rate=22050Hz, Channel=80

Frame_shift_ms=12.5msFrame_shift_ms=12.5ms

Frame_length_ms=50msFrame_length_ms=50ms

Hop_size=275(frmae_shift_ms*sample_rate)Hop_size=275(frmae_shift_ms*sample_rate)

Win_size=1100(frame_length_ms*sample_rate) win_size=1100(frame_length_ms*sample_rate)

도 2a를 참조하면, 제 1 이미지 생성부(120)는 멜 스펙트로그램의 형태 데이터(예컨대, (프레임, 채널)과 같이 구성됨)에 기초하여 x 축을 시간, y 축을 프레임으로 하는 2차원 배열(200)을 생성할 수 있다. 예를 들어, 멜 스펙트로그램의 형태가 (2000, 80)인 경우, 제 1 이미지 생성부(120)는 2차원 배열(200)을 통해 제 1 이미지를 생성할 수 있다.Referring to FIG. 2A, the first image generating unit 120 generates a two-dimensional array (200) with time on the x-axis and frame on the y-axis based on the shape data of the MEL spectrogram (eg, configured as (frame, channel)). ) can be created. For example, when the shape of the MEL spectrogram is (2000, 80), the first image generator 120 may generate the first image through the 2D array 200 .

도 2b를 참조하면, 이러한 과정을 통해 생성된 제 1 이미지 생성부(120)는 시간축(210)과 주파수축(220)으로 표현된 제 1 이미지를 생성할 수 있다. Referring to FIG. 2B , the first image generating unit 120 generated through this process may generate a first image represented by a time axis 210 and a frequency axis 220 .

도 3은 본 발명의 일 실시예에 따른 제 1 이미지로부터 RGB 데이터를 추출하기 위한 4차원 배열을 나타낸 도면이다.3 is a diagram showing a 4-dimensional array for extracting RGB data from a first image according to an embodiment of the present invention.

도 2b 및 도 3을 참조하면, RGB 추출부(130)는 제 1 이미지로부터 RGB 데이터를 추출할 수 있다. Referring to FIGS. 2B and 3 , the RGB extraction unit 130 may extract RGB data from the first image.

예를 들어, RGB 추출부(130)는 2차원 배열(200)로부터 x 축을 시간, y 축을 프레임으로 하는 4차원 배열(300)을 추출할 수 있다. 여기서, 배열의 각 셀에는 R, G, B 값이 포함될 수 있다.For example, the RGB extractor 130 may extract a 4-dimensional array 300 having an x-axis as time and a y-axis as a frame from the 2-dimensional array 200 . Here, each cell of the array may include R, G, and B values.

RGB 추출부(130)는 4차원 배열(300)로부터 R, G, B 데이터를 추출할 수 있다.The RGB extraction unit 130 may extract R, G, and B data from the 4-dimensional array 300 .

제 2 이미지 추론부(140)는 학습 모델을 이용하여 제 1 이미지로부터 텍스트에 대응하는 제 2 이미지를 추론할 수 있다. 여기서, 제 2 이미지는 멜 스펙트로그램을 나타내는 이미지일 수 있다.The second image inference unit 140 may infer a second image corresponding to text from the first image by using a learning model. Here, the second image may be an image representing a MEL spectrogram.

예를 들어, 제 2 이미지 추론부(140)는 텍스트가 "안녕하세요"인 경우, "안녕하세요"를 '5131...'와 같이 숫자로 인코딩하여 제 1 이미지와 함께 학습 모델에 입력함으로써, 제 2 이미지를 추론할 수 있다. 이 때, 제 2 이미지 추론부(140)는 숫자로 입력된 텍스트에 대해 음소 간의 순서 특징을 추출하고, 제 1 이미지로부터 텍스트의 음소에 대응하도록 제 2 이미지를 추론할 수 있다. For example, when the text is "hello", the second image reasoning unit 140 encodes "hello" as a number such as '5131...' and inputs it to the learning model along with the first image, thereby generating a second image. images can be inferred. At this time, the second image inference unit 140 may extract order characteristics between phonemes for the text input as numbers, and infer a second image to correspond to the phonemes of the text from the first image.

이를 위해, 제 2 이미지 추론부(140)는 사전에 제 1 이미지 및 다양한 텍스트(예컨대, 안녕하세요, 반갑습니다, 사랑합니다 등)를 입력하여 각 텍스트에 대응하는 이미지를 생성하도록 학습 모델을 학습시킬 수 있다. To this end, the second image inference unit 140 inputs the first image and various texts (eg, hello, nice to meet you, I love you, etc.) in advance and trains the learning model to generate images corresponding to each text. there is.

예를 들어, 제 2 이미지 추론부(140)는 학습 모델에 추출된 RGB 데이터를 입력하여 제 2 이미지를 추론할 수 있다. For example, the second image inference unit 140 may infer the second image by inputting the extracted RGB data to the learning model.

제 2 이미지 추론부(140)는 추출된 RGB 데이터 및 주파수 스펙트로그램의 형태에 기초하여 이미지 형태를 도출하고, 도출된 이미지 형태를 재배열하여 제 2 이미지를 추론할 수 있다. The second image inference unit 140 may derive an image shape based on the extracted RGB data and the shape of the frequency spectrogram, and rearrange the derived image shape to deduce the second image.

다시 도 1로 돌아와서, 음색 특징 추출부(150)는 입력된 샘플 음원으로부터 사용자의 음성의 특징을 추출할 수 있다. 여기서, 사용자의 음성의 특징은 음도(pitch), 비주기적 스펙트럼(aperiodic spectral), 하모닉 스펙트럼(harmonic spectral) 등을 포함할 수 있다. 이는, 제 2 이미지로부터 합성 음원을 생성할 시, 사용자의 목소리에 관한 주파수 정보가 없으므로, 제 2 이미지에 이를 삽입하기 위함이다. Returning to FIG. 1 again, the timbre feature extractor 150 may extract the user's voice feature from the input sample sound source. Here, the characteristics of the user's voice may include a pitch, an aperiodic spectrum, a harmonic spectrum, and the like. This is to insert the synthesized sound source into the second image since there is no frequency information about the user's voice when generating the synthesized sound source from the second image.

예를 들어, 음색 특징 추출부(150)는 주기적인 파형의 기본 주파수(일반적으로 음성신호에서 음도는 40~400Hz 사이에 존재)인 음도(pitch, F0)를 ACF(Autocorrelation), AMDF(Average Magnitude Difference Function), Ceptrum 등을 이용하여 추출할 수 있다. For example, the timbre feature extractor 150 converts the pitch (F0), which is the fundamental frequency of a periodic waveform (typically, the pitch in a voice signal exists between 40 and 400 Hz) into ACF (Autocorrelation) and AMDF (Average Magnitude). Difference Function), Ceptrum, etc.

다른 예를 들어, 음색 특징 추출부(150)는 샘플 음원에 대해 역 필터링을 수행하고, 잔차 프레임을 추출하여 이산 퓨리에 변환(DFT, Discrete Fourier Transform)을 계산하고, 주기적 스펙트럼을 추출하고, 영위상(zero-phase)에 기초하여 합성된 주기 프레임(Periodic Frame) 및 PSOLA(Pitch Synchronous Overlap and Add)에 기초하여 생성된 주기적 여기 신호(Periodic Exitation)에 기초하여 비주기적 스펙트럼을 추출할 수 있다. For another example, the timbre feature extractor 150 performs inverse filtering on the sample sound source, extracts a residual frame, calculates a Discrete Fourier Transform (DFT), extracts a periodic spectrum, and extracts a zero-phase An aperiodic spectrum may be extracted based on a periodic frame synthesized based on zero-phase and a periodic exitation generated based on Pitch Synchronous Overlap and Add (PSOLA).

또 다른 예를 들어, 음색 특징 추출부(150)는 H(k) = k*F0의 수식을 이용하여 하모닉 스펙트럼을 추출할 수 있다. As another example, the timbre feature extractor 150 may extract the harmonic spectrum using an equation of H(k) = k*F0.

합성 음원 생성부(160)는 추론된 제 2 이미지 및 샘플 음원에 기초하여 합성 음원을 생성할 수 있다. 이 때, 합성 음원 생성부(160)는 추출된 사용자의 음성의 특징을 더 반영하여 합성 음원을 생성할 수 있다. The synthesized sound source generator 160 may generate a synthesized sound source based on the inferred second image and the sample sound source. In this case, the synthesized sound source generator 160 may generate a synthesized sound source by further reflecting the extracted characteristics of the user's voice.

이러한 과정을 통해 양질의 사용자의 음색이 나타내지도록 합성 음원을 생성할 수 있다. Through this process, it is possible to generate a synthesized sound source to represent the user's voice of good quality.

이러한 합성 음원 생성 장치(100)는 이미지를 통한 학습을 이용하여 합성 음원을 생성하는 명령어들의 시퀀스를 포함하는 매체에 저장된 컴퓨터 프로그램에 의해 실행될 수 있다. 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 사용자의 음성이 녹음된 샘플 음원 및 합성 음원의 생성을 위한 텍스트를 포함하는 입력 데이터를 입력받고, 입력된 샘플 음원을 샘플 음원에 대한 주파수 특성을 포함하는 주파수 스펙트로그램으로 변환하고, 변환된 주파수 스펙트로그램에 기초하여 제 1 이미지를 생성하고, 학습 모델을 이용하여 생성된 제 1 이미지로부터 텍스트에 대응하는 제 2 이미지를 추론하고, 추론된 제 2 이미지 및 샘플 음원에 기초하여 합성 음원을 생성하도록 하는 명령어들의 시퀀스를 포함할 수 있다. The apparatus 100 for generating a synthesized sound source may be executed by a computer program stored in a medium including a sequence of instructions for generating a synthesized sound source using learning through images. When the computer program is executed by a computing device, the computer program receives input data including text for generating a sample sound source in which a user's voice is recorded and a synthesized sound source, and converts the input sample sound source to a frequency including frequency characteristics of the sample sound source. Converting to a spectrogram, generating a first image based on the converted frequency spectrogram, inferring a second image corresponding to text from the first image generated using a learning model, and inferring the inferred second image and sample It may include a sequence of instructions to generate a synthesized sound source based on the sound source.

도 4는 본 발명의 일 실시예에 따른 합성 음원 생성 장치에서 이미지를 통한 학습을 이용하여 합성 음원을 생성하는 방법의 순서도이다. 도 4에 도시된 합성 음원 생성 장치(100)에서 이미지를 통한 학습을 이용하여 합성 음원을 생성하는 방법은 도 1 내지 도 3에 도시된 실시예에 따라 합성 음원 생성 장치(100)에 의해 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이라고 하더라도 도 1 내지 도 3에 도시된 실시예에 따른 합성 음원 생성 장치(100)에서 이미지를 통한 학습을 이용하여 합성 음원을 생성하는 방법에도 적용된다. 4 is a flowchart of a method for generating a synthesized sound source by using learning through images in an apparatus for generating a synthesized sound source according to an embodiment of the present invention. A method of generating a synthesized sound source by using learning through images in the apparatus 100 for generating a synthesized sound source shown in FIG. It includes the steps processed by Therefore, even if the content is omitted below, it is also applied to the method of generating a synthesized sound source using learning through images in the apparatus 100 for generating a synthesized sound source according to the embodiment shown in FIGS. 1 to 3 .

단계 S410에서 합성 음원 생성 장치(100)는 사용자의 음성이 녹음된 샘플 음원 및 합성 음원의 생성을 위한 텍스트를 포함하는 입력 데이터를 입력받을 수 있다. In step S410, the synthesized sound source generating apparatus 100 may receive input data including a sample sound source in which the user's voice is recorded and text for generating the synthesized sound source.

단계 S420에서 합성 음원 생성 장치(100)는 입력된 샘플 음원을 샘플 음원에 대한 주파수 특성을 포함하는 주파수 스펙트로그램으로 변환할 수 있다. In step S420, the synthetic sound source generating apparatus 100 may convert the input sample sound source into a frequency spectrogram including frequency characteristics of the sample sound source.

단계 S430에서 합성 음원 생성 장치(100)는 변환된 주파수 스펙트로그램에 기초하여 제 1 이미지를 생성할 수 있다. In step S430, the apparatus 100 for generating a synthesized sound source may generate a first image based on the converted frequency spectrogram.

단계 S440에서 합성 음원 생성 장치(100)는 학습 모델을 이용하여 생성된 제 1 이미지로부터 텍스트에 대응하는 제 2 이미지를 추론할 수 있다. In step S440, the apparatus 100 for generating a synthesized sound source may infer a second image corresponding to the text from the generated first image using the learning model.

단계 S450에서 합성 음원 생성 장치(100)는 추론된 제 2 이미지 및 샘플 음원에 기초하여 합성 음원을 생성할 수 있다. In step S450, the apparatus 100 for generating a synthesized sound source may generate a synthesized sound source based on the inferred second image and the sample sound source.

상술한 설명에서, 단계 S410 내지 S450은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 전환될 수도 있다.In the foregoing description, steps S410 to S450 may be further divided into additional steps or combined into fewer steps, depending on an embodiment of the present invention. Also, some steps may be omitted as needed, and the order of steps may be switched.

도 1 내지 도 4를 통해 설명된 합성 음원 생성 장치에서 이미지를 통한 학습을 이용하여 합성 음원을 생성하는 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램 또는 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 또한, 도 1 내지 도 4를 통해 설명된 합성 음원 생성 장치에서 이미지를 통한 학습을 이용하여 합성 음원을 생성하는 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램의 형태로도 구현될 수 있다. The method for generating a synthesized sound source by using learning through images in the synthetic sound source generating apparatus described with reference to FIGS. 1 to 4 is a recording medium including a computer program stored in a medium executed by a computer or instructions executable by a computer. It can also be implemented in the form of. In addition, the method for generating a synthesized sound source by using image-based learning in the apparatus for generating a synthesized sound source described with reference to FIGS. 1 to 4 may be implemented in the form of a computer program stored in a medium executed by a computer.

컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer readable media may include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustrative purposes, and those skilled in the art can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the detailed description above, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts should be construed as being included in the scope of the present invention. do.

100: 합성 음원 생성 장치
110: 입력부
120: 제 1 이미지 생성부
130: RGB 추출부
140: 제 2 이미지 추론부
150: 음색 특징 추출부
160: 합성 음원 생성부100: synthetic sound source generating device
110: input unit
120: first image generator
130: RGB extraction unit
140: second image reasoning unit
150: tone feature extraction unit
160: synthetic sound generator

Claims

An apparatus for generating a synthesized sound source using learning through images,
an input unit for receiving input data including text for generating a sample sound source in which a user's voice is recorded and a synthesized sound source;
a first image generating unit that converts the input sample sound source into a frequency spectrogram including frequency characteristics of the sample sound source, and generates a first image based on the converted frequency spectrogram;
a second image inference unit inferring a second image corresponding to the text from the generated first image using a learning model; and
A synthesized sound source generator for generating a synthesized sound source based on the inferred second image and the sample sound source
including,
Information on the text and the first image are input to the learning model, and the second image is inferred from the learning model;
Further comprising an RGB extraction unit for extracting RGB data from the first image,
wherein the second image inference unit infers the second image by inputting the extracted RGB data to the learning model.

According to claim 1,
A timbre feature extraction unit extracting characteristics of the user's voice from the input sample sound source,
The synthesized sound source generating unit generates the synthesized sound source by further reflecting the extracted characteristics of the user's voice.

According to claim 2,
The feature of the user's voice includes at least one of a pitch, an aperiodic spectrum, and a harmonic spectrum.

According to claim 1,
Wherein the frequency spectrogram is a Mel spectrogram.

delete

According to claim 1,
Wherein the second image inference unit derives an image shape based on the extracted RGB data and the shape of the frequency spectrogram, and infers the second image by rearranging the derived image shape. .

According to claim 1,
wherein the second image reasoning unit learns the learning model to generate a plurality of images by inputting the first image and a plurality of texts.

A method for generating a synthesized sound source using learning through images in a synthesized sound source generating device,
Receiving input data including text for generating a sample sound source in which a user's voice is recorded and a synthesized sound source;
converting the input sample sound source into a frequency spectrogram including frequency characteristics of the sample sound source;
generating a first image based on the converted frequency spectrogram;
inferring a second image corresponding to the input text from the generated first image using a learning model; and
Generating a synthesized sound source based on the inferred second image and the sample sound source;
Information on the text and the first image are input to the learning model, and the second image is inferred from the learning model;
extracting RGB data from the first image; and
Further comprising the step of inferring the second image by inputting the extracted RGB data to the learning model.

According to claim 8,
extracting characteristics of the user's voice from the input sample sound source; and
Further comprising the step of generating the synthesized sound source by further reflecting the characteristics of the extracted user's voice.

According to claim 9,
The feature of the user's voice includes at least one of a pitch, an aperiodic spectrum, and a harmonic spectrum.

According to claim 8,
The frequency spectrogram is a Mel spectrogram, a method for generating a synthesized sound source.

delete

According to claim 11,
deriving an image shape based on the extracted RGB data and the shape of the frequency spectrogram; and
Further comprising inferring the second image by rearranging the derived image shape.

According to claim 8,
Further comprising the step of training the learning model to generate a plurality of images by inputting the first image and a plurality of texts.

A computer program stored in a medium containing a sequence of instructions for generating a synthesized sound source using learning through images,
When the computer program is executed by a computing device,
Receive input data including text for generating a sample sound source and synthesized sound source in which the user's voice is recorded;
converting the input sample sound source into a frequency spectrogram including frequency characteristics of the sample sound source, and generating a first image based on the converted frequency spectrogram;
Inferring a second image corresponding to the input text from the generated first image using a learning model;
generating a synthesized sound source based on the inferred second image and the sample sound source;
A sequence of instructions for inferring the second image from the learning model by inputting information about the text and the first image to the learning model;
Extracting RGB data from the first image;
The command to infer the second image is,
A computer program stored in a medium including instructions for inferring the second image by inputting the extracted RGB data to the learning model.