KR102035448B1

KR102035448B1 - Voice instrument

Info

Publication number: KR102035448B1
Application number: KR1020190014768A
Authority: KR
Inventors: 조성구
Original assignee: 세명대학교 산학협력단; (주)빛과 수학
Priority date: 2019-02-08
Filing date: 2019-02-08
Publication date: 2019-11-15

Abstract

An embodiment of the present invention relates to voice instrument. The voice instrument comprises: an input unit receiving source music information including lyric information, pitch information, and volume information from a user and including a skin vibration sensor; a storage unit storing data related to music including target music information including lyric information and pitch information; a control unit obtaining the target music information having lyric information and pitch information corresponding to the lyric information and pitch information of the source music information from the storage unit and amplifying the obtained target music information according to the volume information of the source music information to obtain the amplified target music information; and an output unit outputting the amplified target music information. The target music information includes tone information different from the source music information.

Description

Voice instrument

본 발명은 음성 악기에 관한 것이다.The present invention relates to a voice musical instrument.

보통의 악기는 악기의 구조와 소리를 내는 방법에 따라 음색이 다른 다양한 소리를 내게 만든 것이다. 사람의 노래 소리에 관한한 사람의 구강 구조 자체가 하나의 악기라고 볼 수 있다.Ordinary instruments produce different sounds with different tones, depending on the instrument's structure and how it sounds. As far as the sound of a person's song is concerned, the human oral structure itself is an instrument.

이러한 음성 악기에 관하여, US4,236,434(Dec. 2, 1980)에서는 사람의 목소리를 흉내내는 어휘형성소(formant) 필터 등을 하드웨어로 만들어 소리를 내게 하는 시도가 있었다. 그러나 하드웨어 어휘형성소는 사람의 목소리만큼 다양하지도 못하고 정교하지도 못했다.Regarding such a voice instrument, US 4,236,434 (Dec. 2, 1980) has attempted to make a sound by forming a formant filter that mimics a human voice using hardware. But hardware lexicons are not as diverse and sophisticated as human voices.

한편, US 2009/0089063 A1(Apr. 2, 2009)에서는 목소리의 변환을 하기 위하여 소스 목소리 주파수를 변환하여 타겟 목소리의 주파수와 일정 부분 대체하는 방법을 이용하는 방법도 제시되었다. 그러나, 이 방법은 소스와 타겟 목소리가 서로 섞이어 타겟 목소리만으로 노래를 하기는 어렵다.Meanwhile, in US 2009/0089063 A1 (Apr. 2, 2009), a method of converting a source voice frequency and replacing a portion of a frequency of a target voice by using a method of converting a voice is also presented. However, this method mixes the source and target voices, making it difficult to sing with the target voice alone.

또한, US10008193 B1(Jun. 26, 2018)은 소스 목소리를 타겟 목소리로 노래 부르게 하기 위하여 두 사람 노래의 평균 음높이를 구하고 소스 목소리의 리듬과 타겟 목소리의 피치를 합성하는 방법을 제시하고 있다. 그러나 이 방법 역시 소스와 타겟 목소리가 서로 섞이고 노래를 못 부르는 사람에게는 크게 도움이 되지 못하는 단점이 있다.In addition, US10008193 B1 (Jun. 26, 2018) presents a method of obtaining the average pitch of two songs and synthesizing the rhythm of the source voice and the pitch of the target voice in order to sing the source voice as the target voice. However, this method also has the disadvantage that the source and target voices are mixed and not very helpful to people who cannot sing.

나아가 종래에 마이크로 입력되는 노래 소리는 주위 환경이 시끄러우면 노이즈가 너무 커서 목적하는 바를 이루기 어렵다.Furthermore, conventionally, the sound of the song input to the microphone is too loud when the surrounding environment is too loud to achieve the desired purpose.

(특허문헌 1) US4236434 B1 (Patent Document 1) US4236434 B1

(특허문헌 2) US20090089063 A1 (Patent Document 2) US20090089063 A1

(특허문헌 3) US10008193 B1 (Patent Document 3) US10008193 B1

사람에 따라 노래를 잘 부르는 사람도 있지만 그렇지 못한 사람도 많다. 흔히 노래를 못 부르는 사람들은 노래를 잘 부르는 사람을 보고 자신도 그렇게 잘 불렀으면 하고 부러워한다.Some people sing well, but many do not. Often people who can't sing see the person who sings well and envy them that they want to sing so well.

본 발명은 이러한 사람들의 욕구를 충족시키기 위한 것으로, 비록 노래를 잘 부르지 못한다 할지라도 예컨대 원하는 가수의 목소리로 노래 소리를 발성하게 할 수 있는 음성 악기에 관한 것이다. The present invention is to meet the needs of these people, and relates to a voice musical instrument that can make a song sound, for example, in the voice of a desired singer even if it is not able to sing well.

또한, 본 발명은 다른 노래 소리나 노이즈가 있는 환경 속에서도 종래의 악기들 예컨대, 기타나 바이올린, 피아노, 북 등을 연주할 수 있듯이 다른 노래 소리나 노이즈가 있는 환경 속에서도 본 발명의 음성 악기를 연주할 수 있게 하는 것이다.In addition, the present invention can play the voice instrument of the present invention in an environment with other song sounds or noise, as can be played conventional instruments, such as guitars, violins, piano, drums, even in the presence of other song sounds or noise. To make it possible.

본 실시예에 따른 음성 악기는 사용자로부터 가사 정보, 음높이 정보 및 볼륨 정보를 포함하는 소스 노래 소리 정보를 입력받고 피부진동센서를 포함하는 입력부; 가사 정보 및 음높이 정보를 포함하는 타겟 노래 소리 정보를 포함하는 노래 소리 관련 데이터를 저장하는 저장부; 상기 저장부로부터 상기 소스 노래 소리 정보의 상기 가사 정보와 상기 음높이 정보에 대응하는 가사 정보와 음높이 정보를 갖는 상기 타겟 노래 소리 정보를 획득하고, 획득된 타겟 노래 소리 정보를 상기 소스 노래 소리 정보의 상기 볼륨 정보에 따라 증폭하여 증폭된 타겟 노래 소리 정보를 획득하는 제어부; 및 상기 증폭된 타겟 노래 소리 정보를 출력하는 출력부를 포함하고, 상기 타겟 노래 소리 정보는 상기 소스 노래 소리 정보와 상이한 음색 정보를 포함할 수 있다.According to an exemplary embodiment of the present disclosure, a voice musical instrument may include: an input unit including a skin vibration sensor and receiving source song sound information including lyrics information, pitch information, and volume information from a user; A storage unit for storing song sound related data including target song sound information including lyrics information and pitch information; Obtaining the target song sound information having the lyrics information of the source song sound information, the lyrics information corresponding to the pitch information, and the pitch information from the storage unit, and obtaining the obtained target song sound information from the source song sound information; A control unit for amplifying the target song sound information by amplifying according to the volume information; And an output unit configured to output the amplified target song sound information, wherein the target song sound information may include tone information different from the source song sound information.

상기 입력부는 상기 소스 노래 소리 정보의 상기 음높이 정보를 입력받는 제1입력장치를 더 포함할 수 있다.The input unit may further include a first input device configured to receive the pitch information of the source song sound information.

상기 입력부는 상기 소스 노래 소리 정보의 상기 볼륨 정보를 입력받는 제2입력장치를 더 포함할 수 있다.The input unit may further include a second input device configured to receive the volume information of the source song sound information.

상기 저장부에 저장되는 상기 노래 소리 관련 데이터는 단위 시간 동안 발성될 수 있는 복수의 가사 정보 각각에 대하여 서로 다른 음높이 정보를 갖는 타겟 노래 소리 정보 테이블을 포함할 수 있다.The song sound related data stored in the storage unit may include a target song sound information table having different pitch information for each of a plurality of pieces of lyrics information that may be spoken for a unit time.

상기 저장부에 저장되는 상기 타겟 노래 소리 정보의 단위 시간은 상기 소스 노래 소리 정보의 단위 시간 보다 길 수 있다.The unit time of the target song sound information stored in the storage unit may be longer than the unit time of the source song sound information.

상기 저장부는 단위 시간 동안 발성될 수 있는 복수의 가사 정보를 분류하기 위한 복수의 특징벡터 또는 인공지능 변수 벡터가 저장될 수 있다. The storage unit may store a plurality of feature vectors or AI variable vectors for classifying a plurality of pieces of lyrics information that may be spoken for a unit time.

상기 저장부는 상기 타겟 노래 소리 정보가 2그룹 이상 포함될 수 있다.The storage unit may include two or more groups of the target song sound information.

상기 소스 노래 소리 정보 및/또는 상기 노래 소리 관련 데이터를 입력, 출력, 선택, 저장, 변경 및/또는 삭제하기 위한 입력장치 및 출력장치를 포함할 수 있다.And an input device and an output device for inputting, outputting, selecting, storing, changing, and / or deleting the source song sound information and / or the song sound related data.

상기 소스 노래 소리 정보 및/또는 상기 노래 소리 관련 데이터를 다운로드 및/또는 업로드 하기 위한 통신장치를 포함할 수 있다.And a communication device for downloading and / or uploading the source song sound information and / or the song sound related data.

상기 제어부는 상기 입력부로부터 입력되는 상기 소스 노래 소리 정보로부터 컨벌류션 신경망(Convolution Neural Network, CNN) 또는 생성적 적대 네트워크(Generative Adversarial Network, GAN)로 학습된 인공지능 변수 벡터를 이용하여 상기 가사 정보를 분류할 수 있다.The control unit uses the artificial intelligence variable vector learned from the source song sound information input from the input unit as a convolutional neural network (CNN) or a generative adversarial network (GAN). Can be classified.

상기 타겟 노래 소리 정보는 생성적 적대 네트워크로 학습된 생성자 네트워크 인공지능 변수 벡터를 이용하여 생성할 수 있다.The target song sound information may be generated using a generator network AI variable vector trained with a generative hostile network.

상기 타겟 노래 소리 정보는 랩, 판소리, 경극소리, 액센트 및/또는 인토네이션 데이터를 포함할 수 있다.The target song sound information may include rap, pansori, peking opera, accent and / or intonation data.

본 실시예에 따른 음성 처리 방법은 사용자로부터 피부진동센서를 포함하는 입력부를 통해 가사 정보, 음높이 정보 및 볼륨 정보를 포함하는 소스 노래 소리 정보를 입력받는 단계; 저장부로부터 상기 소스 노래 소리 정보의 상기 가사 정보와 상기 음높이 정보에 대응하는 가사 정보와 음높이 정보를 갖는 타겟 노래 소리 정보를 획득하는 단계; 획득된 타겟 노래 소리 정보를 상기 소스 노래 소리 정보의 상기 볼륨 정보에 따라 증폭하여 증폭된 타겟 노래 소리 정보를 획득하는 단계; 및 상기 증폭된 타겟 노래 소리 정보가 노래 소리로 출력되는 단계를 포함하고, 상기 타겟 노래 소리 정보는 상기 소스 노래 소리 정보와 상이한 음색 정보를 포함할 수 있다.The voice processing method according to the present embodiment includes receiving source song sound information including lyrics information, pitch information, and volume information from an input unit including a skin vibration sensor from a user; Obtaining target song sound information having lyrics information and pitch information corresponding to the lyrics information of the source song sound information and the pitch information from a storage unit; Amplifying the acquired target song sound information according to the volume information of the source song sound information to obtain amplified target song sound information; And outputting the amplified target song sound information as song sounds, wherein the target song sound information may include tone information different from the source song sound information.

본 실시예에 따른 컴퓨터 프로그램은 본 실시예에 따른 음성 처리 방법을 구현하기 위하여 기록매체에 저장될 수 있다.The computer program according to the present embodiment may be stored in a recording medium to implement the voice processing method according to the present embodiment.

사람의 구강구조도 일종의 악기이다. 본 발명을 통하여 악기를 연주하듯 노래를 못 부르는 사람도 원하는 목소리로 노래할 수 있다. The human oral structure is also a kind of musical instrument. Through the present invention, even a person who can not sing like a musical instrument can sing in a desired voice.

나아가 독창뿐만 아니라 이중창, 삼중창 등도 노래할 수도 있다.You can also sing not only solo but also double or triplet.

또한, 다른 악기 소리, 다른 사람의 노래 소리 또는 소음 등의 주위환경 소리에 무관하게 소스 목소리를 정확히 구분할 수 있다.In addition, the source voice can be accurately distinguished regardless of the surrounding sounds such as the sound of another instrument, the song of another person, or the noise.

도 1은 본 발명의 제1실시예에 따른 음성 악기의 구성도이다.
도 2는 본 발명의 제1실시예에 따른 음성 악기의 저장부에 저장되는 타겟 노래 소리 정보 테이블이다.
도 3은 본 발명의 제1실시예에 따른 음성 악기의 특징벡터를 이용한 제어부의 기능을 나타내는 흐름도이다.
도 4는 본 발명의 제1실시예에 따른 음성 악기의 특징벡터로 이루어진 소스 노래 소리 정보 테이블이다.
도 5는 본 발명의 제1실시예의 변형례에 따른 음성 악기의 인공지능 알고리즘을 적용한 제어부의 기능을 나타내는 흐름도이다.
도 6a의 (a)는 본 발명의 제1실시예의 변형례에 따른 컨벌류션 신경망을 이용한 인공지능 학습 방법을 나타낸 흐름도이고, 도 6a의 (b)는 컨벌류션 신경망의 다층 레이어를 이용한 변환 과정을 나타낸 흐름도이다.
도 6b는 본 발명의 제1실시예의 변형례에 따른 생성적 적대 네트워크를 이용한 인공지능 학습 방법을 나타낸 흐름도이다.
도 7은 본 발명의 제2실시예에 따른 음성 악기의 구성도이다.1 is a block diagram of a voice musical instrument according to a first embodiment of the present invention.
2 is a target song sound information table stored in a storage unit of a voice musical instrument according to a first embodiment of the present invention.
3 is a flowchart showing the function of the controller using the feature vector of the voice musical instrument according to the first embodiment of the present invention.
4 is a source song sound information table consisting of feature vectors of a voice musical instrument according to a first embodiment of the present invention.
5 is a flowchart showing the function of the controller to which the artificial intelligence algorithm of the voice musical instrument according to the modification of the first embodiment of the present invention is applied.
6A is a flowchart illustrating an artificial intelligence learning method using a convolutional neural network according to a modification of the first embodiment of the present invention, and FIG. 6A (b) illustrates a transformation process using a multi-layered layer of the convolutional neural network. The flow chart shown.
6B is a flowchart illustrating an artificial intelligence learning method using a generative antagonist network according to a modification of the first embodiment of the present invention.
7 is a block diagram of a voice musical instrument according to a second embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

다만, 본 발명의 기술 사상은 설명되는 일부 실시예에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있고, 본 발명의 기술 사상 범위 내에서라면, 실시예들간 그 구성 요소들 중 하나 이상을 선택적으로 결합 또는 치환하여 사용할 수 있다.However, the technical idea of the present invention is not limited to some embodiments described, but may be implemented in various forms, and within the technical idea of the present invention, one or more of the components between the embodiments may be selectively selected. Can be combined or substituted.

또한, 본 발명의 실시예에서 사용되는 용어(기술 및 과학적 용어를 포함)는, 명백하게 특별히 정의되어 기술되지 않는 한, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 일반적으로 이해될 수 있는 의미로 해석될 수 있으며, 사전에 정의된 용어와 같이 일반적으로 사용되는 용어들은 관련 기술의 문맥상의 의미를 고려하여 그 의미를 해석할 수 있을 것이다.In addition, terms used in the embodiments of the present invention (including technical and scientific terms) are generally understood by those skilled in the art to which the present invention pertains, unless specifically defined and described. The terms commonly used, such as terms defined in advance, may be interpreted as meanings in consideration of the contextual meaning of the related art.

또한, 본 발명의 실시예에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. In addition, the terms used in the embodiments of the present invention are intended to describe the embodiments and are not intended to limit the present invention.

본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함할 수 있고, "A 및(와) B, C 중 적어도 하나(또는 한 개 이상)"로 기재되는 경우 A, B, C로 조합할 수 있는 모든 조합 중 하나 이상을 포함할 수 있다.In this specification, the singular may also include the plural unless specifically stated in the text, and may be combined with A, B, and C when described as "at least one (or more than one) of A and B and C". It can include one or more of all possible combinations.

또한, 본 발명의 실시예의 구성 요소를 설명하는데 있어서, 제1, 제2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성요소의 본질이나 차례 또는 순서 등으로 한정되지 않는다.In addition, in describing the components of the embodiment of the present invention, terms such as first, second, A, B, (a), and (b) may be used. These terms are only to distinguish the components from other components, and the terms are not limited to the nature, order, order, etc. of the components.

그리고, 어떤 구성 요소가 다른 구성 요소에 '연결', '결합', 또는 '접속'된다고 기재된 경우, 그 구성 요소는 그 다른 구성 요소에 직접적으로 '연결', '결합', 또는 '접속'되는 경우뿐만 아니라, 그 구성 요소와 그 다른 구성 요소 사이에 있는 또 다른 구성 요소로 인해 '연결', '결합', 또는 '접속'되는 경우도 포함할 수 있다.And when a component is described as being 'connected', 'coupled', or 'connected' to another component, the component is 'connected', 'coupled', or 'connected' directly to the other component. In addition to the case, it may also include the case where the 'connected', 'coupled' or 'connected' due to another component between the component and the other component.

이하에서는 본 발명의 제1실시예에 따른 음성 악기의 구성을 도면을 참조하여 설명한다. Hereinafter, the configuration of the voice musical instrument according to the first embodiment of the present invention will be described with reference to the drawings.

도 1은 본 발명의 제1실시예에 따른 음성 악기의 구성도이고, 도 2는 본 발명의 제1실시예에 따른 음성 악기의 저장부에 저장되는 타겟 노래 소리 정보 테이블이다.1 is a block diagram of a voice instrument according to a first embodiment of the present invention, and FIG. 2 is a target song sound information table stored in a storage unit of the voice instrument according to the first embodiment of the present invention.

노래 소리 정보는 노래의 가사 정보, 노래 소리의 음높이 정보, 노래 소리의 볼륨 정보로 이루어진다. 본 발명의 입력부(100)는 사용자가 입력하는 소스 노래 소리 정보를 입력받는다. 이때, 입력부(100)로 입력되는 소스 노래 소리 정보는 공기의 진동을 감지하는 마이크를 이용하여 입력되는 것이 아니라, 볼이나 목 주위 등에 접촉하여 피부의 진동을 감지하는 피부진동센서(110)를 이용하여 입력될 수 있다. 단순 공기 진동에 의한 마이크는 다른 주위의 소음에 취약하고, 인공지능 기술을 적용하여 소음을 지운다 해도 노이즈 제거 계산에 부하가 많이 걸리는 단점이 있다. 더욱이 두 명이 마이크로부터 같은 거리에서 노래를 부른다면 누구의 목소리를 소스 목소리로 보고 누구의 목소리를 노이즈로 볼 것인가는 애매하다. 따라서 본 발명의 음성 악기(10)에서는 이러한 부하와 모호성을 원천적으로 차단하기 위하여 피부진동센서(110)를 이용하여 노래 소리 정보를 입력할 수 있다. 피부진동센서(110)를 부착한 사람 외에는 소음이나 다른 사람의 목소리는 피부진동센서(110)에 거의 영향을 주지 않는다. 노래 소리 정보를 좀 더 정확하게 획득하고 분석하기 위하여 입력부(100)는 피부진동센서(110)를 2개 이상 포함할 수도 있다. 피부진동센서(110)의 일 예로는 거미의 감각기관을 모사한 균열센서 등이 있다. 피부진동센서(110)는 점탄성 성질을 가지는 고분자 물질 위에 딱딱한 성질을 가지는 물질을 증착한 후, 센서를 굽혀 인위적 균열을 생성하여 제작할 수 있다. 예를 들어, 점탄성 물질은 PUA(Polyurethane acrylate)일 수 있고, 딱딱한 성질을 가지는 물질은 백금 금속일 수 있다.The song sound information consists of lyrics information of the song, pitch information of the song sound, and volume information of the song sound. The input unit 100 of the present invention receives the source song sound information input by the user. At this time, the source song sound information input to the input unit 100 is not input using a microphone for detecting the vibration of the air, but using the skin vibration sensor 110 for detecting the vibration of the skin in contact with the ball or around the neck. Can be input. The microphone due to simple air vibration is vulnerable to other ambient noises, and even if the noise is removed by applying artificial intelligence technology, the noise reduction calculation takes a lot of load. What's more, if two people sing at the same distance from the microphone, it's ambiguous to see whose voice is the source voice and whose voice is the noise. Therefore, in the voice musical instrument 10 of the present invention, in order to cut off such load and ambiguity, it is possible to input song sound information using the skin vibration sensor 110. Other than the person to which the skin vibration sensor 110 is attached, the noise or the voice of another person has little effect on the skin vibration sensor 110. In order to more accurately acquire and analyze song sound information, the input unit 100 may include two or more skin vibration sensors 110. An example of the skin vibration sensor 110 includes a crack sensor that simulates a spider's sense organs. The skin vibration sensor 110 may be manufactured by depositing a material having a hard property on a polymer material having a viscoelastic property and then bending the sensor to generate an artificial crack. For example, the viscoelastic material may be PUA (Polyurethane acrylate), and the material having hard properties may be platinum metal.

저장부(200)에는 타겟 노래 소리 정보를 포함한 노래 소리 관련 데이터들을 저장한다. 노래 소리 관련 데이터는 소스 노래 소리 정보는 물론 소스 노래 소리의 특징들, 소스 노래 소리를 분석하는 알고리즘 변수, 인공지능 변수 또는 음성 악기(10)로 연주한 노래 소리 등도 포함할 수 있다. 나아가 타겟 노래 소리 정보는 소스 노래 소리 정보와 상이한 음색을 가질 수 있다. 이때, 타겟 노래 소리 정보의 음색은 사용자가 원하는 사람의 음색으로 선택될 수 있다. 사용자가 원하는 사람의 음색은 가수의 음색일 수 있다. 소스 노래 소리 정보나 타겟 노래 소리 정보를 획득하기 위하여 단위 시간을 먼저 결정할 수 있다. 보통 단위 시간을 20 ms ~ 50 ms 정도로 하는 것이 바람직하나 반드시 이에 한정하는 것은 아니다. 본 발명의 타겟 노래 소리 정보는 주어진 단위 시간 동안 발성될 수 있는 복수의 가사 정보 각각에 대하여 서로 다른 음높이의 노래 소리 정보들로 이루어진다. 여기서, 노래 소리 정보를 획득하기 위한 타겟 노래 소리 정보의 단위 시간과 소스 노래 소리 정보의 단위 시간이 반드시 같을 필요는 없다. 오히려 타겟 노래 소리 정보를 획득하기 위한 단위 시간은 소스 노래 소리 정보를 획득하기 위한 단위 시간 보다 더 긴 것이 바람직하다. The storage unit 200 stores song sound related data including target song sound information. The song sound related data may include source song sound information as well as characteristics of the source song sound, algorithm variables for analyzing the source song sound, artificial intelligence variables, song sounds played by the voice instrument 10, and the like. Furthermore, the target song sound information may have a different tone from the source song sound information. In this case, the tone of the target song sound information may be selected as the tone of the person desired by the user. The tone of the person desired by the user may be the tone of the singer. The unit time may be determined first to obtain source song sound information or target song sound information. Usually, the unit time is preferably about 20 ms to 50 ms, but is not necessarily limited thereto. The target song sound information of the present invention is composed of song sound information of different pitches for each of a plurality of pieces of lyrics information that can be uttered for a given unit time. Here, the unit time of the target song sound information for acquiring the song sound information and the unit time of the source song sound information do not necessarily need to be the same. Rather, the unit time for acquiring the target song sound information is preferably longer than the unit time for acquiring the source song sound information.

따라서 저장부(200)에 저장되는 타겟 노래 소리 정보들은 테이블 구조가 될 수 있다. 도 2는 저장부(200)에 저장되는 타겟 노래 소리 정보들을 나타내는 타겟 노래 소리 정보 테이블의 일예로 편의상 1옥타브부터 8옥타브까지 각 음계에 따른 N개의 노래 소리 정보들을 나타내고 있다. 예컨대 (1)행에 있는 모든 노래 소리 정보들은 가사 정보 /a/를 각 옥타브의 각 음계에 따라 발성한 노래 소리 정보 [a]를 표시한 것이다. 여기서 음계 G의 [a]와 음계 A의 [a]는 같은 가사 정보 /a/에 해당되고 서로 다른 음높이를 갖지만 불필요한 기호의 복잡함을 피하기 위하여 같은 기호 [a]로 표시하였다. (2)행에 있는 모든 노래 소리 정보들은 가사 정보 /i/를 각 옥타브의 각 음계에 따라 발성한 노래 소리들 [i]를 표시한 것이다. 물론, 사람(가수)이 1옥타브부터 8옥타브의 모든 음높이에 따라 /a/나 /i/등을 발성하기는 어려울 수도 있기 때문에 도 2의 테이블의 모든 칸이 발성된 노래 소리 정보로 채워질 필요는 없다. 해당 노래 소리 정보가 없으면 근접한 것으로 대체할 수 있다.Therefore, the target song sound information stored in the storage 200 may have a table structure. FIG. 2 is an example of a target song sound information table representing target song sound information stored in the storage unit 200. For convenience, N song sound information according to each scale from 1 octave to 8 octaves is shown. For example, all song sound information in the line (1) represents the song sound information [a] uttered by the lyrics information / a / according to each scale of each octave. Here, [a] of the scale G and [a] of the scale A correspond to the same lyrics information / a / and have the same pitch, but are denoted by the same symbol [a] to avoid unnecessary symbol complexity. All song sound information in line (2) shows the song information [i] which uttered the lyrics information / i / according to each scale of each octave. Of course, it may be difficult for a person (singer) to utter / a / or / i / etc according to all pitches from 1 octave to 8 octaves, so it is not necessary to fill every cell of the table of FIG. none. If there is no song sound information, it can be replaced by a close one.

여기서 가사 정보는 단위 시간의 길이에 따라 달라질 수 있지만 보통 변별적 자질(distinctive features)이나 음소(phonemes), 어휘형성소(fragments), 문자소(graphemes)가 될 수 있다. IPA(International Phonetic Alphabet)에는 107 소리기호(sound symbols), 52 액센트(accents), 4 인토네이션(intonations)이 있는데 이것을 기준으로 타겟 노래 소리 정보 테이블을 만들 수도 있다.The lyrics information may vary depending on the length of the unit time, but may be distinctive features, phonemes, lexicons, and graphemes. The International Phonetic Alphabet (IPA) has 107 sound symbols, 52 accents, and 4 intonations, which can be used to create a target song sound information table.

본 발명의 제1실시예에 따른 음성 악기(10)는 음높이 정보와 볼륨 정보의 조절을 통하여 랩, 판소리, 경극 등과 같은 발성을 할 수 있다. 또한, 액센트, 인토네이션 등을 갖는 다양한 발성 방법으로 다른 사람의 목소리로 말하는 기능을 수행할 수 있다. 이 경우, 타겟 노래 소리 정보는 랩, 판소리, 경극소리, 액센트 및/또는 인토네이션 데이터를 포함할 수 있다. The voice musical instrument 10 according to the first embodiment of the present invention may perform utterance such as rap, pansori, peking opera, etc. by adjusting the pitch information and the volume information. In addition, it is possible to perform a function of speaking in the voice of another person by various vocal methods having accents, intonations and the like. In this case, the target song sound information may include rap, pansori, peking opera, accent and / or intonation data.

타겟 노래 소리 정보를 출력하는 출력부(300)는 스피커를 직접 포함할 수도 있으나 이에 제한되지는 않는다. 예컨대 신디사이저(synthesizer)와 같은 다른 악기로 연결되어 타겟 노래 소리 정보를 출력할 수도 있다.The output unit 300 for outputting target song sound information may directly include a speaker, but is not limited thereto. For example, it may be connected to another instrument such as a synthesizer to output target song sound information.

제어부(400)는 입력부(100)를 통하여 입력된 소스 노래 소리 정보를 처리하여 출력부(300)로 전송할 수 있다.The controller 400 may process the source song sound information input through the input unit 100 and transmit the processed source song sound information to the output unit 300.

이하에서는 본 발명에 따른 음성 악기를 통한 음성 처리 방법을 설명한다.Hereinafter, a voice processing method through a voice instrument according to the present invention will be described.

음성 처리 방법은 사용자로부터 피부진동센서(110)를 포함하는 입력부(100)를 통해 가사 정보, 음높이 정보 및 볼륨 정보를 포함하는 소스 노래 소리 정보를 입력받는 단계, 저장부(200)로부터 소스 노래 소리 정보의 가사 정보와 음높이 정보에 대응하는 가사 정보와 음높이 정보를 갖는 타겟 노래 소리 정보를 획득하는 단계, 획득된 타겟 노래 소리 정보를 소스 노래 소리 정보의 볼륨 정보에 따라 증폭하여 증폭된 타겟 노래 소리 정보를 획득하는 단계 및 증폭된 타겟 노래 소리 정보가 노래 소리로 출력되는 단계를 포함할 수 있다. 이때, 타겟 노래 소리 정보는 소스 노래 소리 정보와 상이한 음색 정보를 포함할 수 있다.The voice processing method includes receiving source song sound information including lyrics information, pitch information, and volume information through an input unit 100 including a skin vibration sensor 110 from a user, and source song sound from a storage unit 200. Acquiring target song sound information having the lyrics information and the pitch information corresponding to the lyrics information and the pitch information of the information; and amplifying the target song sound information by amplifying the acquired target song sound information according to the volume information of the source song sound information. And acquiring the amplified target song sound information as a song sound. In this case, the target song sound information may include tone information different from the source song sound information.

이하에서는 본 발명의 제1실시예에 따른 음성 악기를 통한 제어부의 음성 처리 방법을 도면을 참조하여 설명한다.Hereinafter, a voice processing method of a controller through a voice musical instrument according to a first embodiment of the present invention will be described with reference to the drawings.

도 3은 본 발명의 제1실시예에 따른 음성 악기의 특징벡터를 이용한 제어부의 기능을 나타내는 흐름도이고, 도 4는 본 발명의 제1실시예에 따른 음성 악기의 특징벡터로 이루어진 소스 노래 소리 정보 테이블이다.3 is a flowchart showing the function of the controller using the feature vector of the voice instrument according to the first embodiment of the present invention, and FIG. 4 is the source song sound information consisting of the feature vector of the voice instrument according to the first embodiment of the present invention. Table.

도 3을 참조하면, 특징벡터를 이용한 음성 처리 방법(1000)은 입력부(100)로부터 소스 노래 소리 정보를 입력받는 단계(1001), 입력된 소스 노래 소리 정보를 분석하여 시간에 따른 노래 소리의 가사 정보, 음높이 정보 및 볼륨 정보를 분류하는 단계(1002), 저장부(200)에 저장되어 있는 타겟 노래 소리 정보에서 소스 노래 소리 정보의 가사 정보와 음높이 정보에 대응하는 가사 정보와 음높이 정보를 갖는 타겟 노래 소리 정보를 매칭하여 획득하는 단계(1003), 획득된 타겟 노래 소리 정보를 소스 노래 소리 정보의 볼륨 정보에 따라 증폭하여 증폭된 타겟 노래 소리 정보를 획득하는 단계(1004) 및 획득한 증폭된 타겟 노래 소리 정보를 출력하기 위하여 출력부(300)로 송신하는 단계(1005)를 포함할 수 있다. Referring to FIG. 3, in the voice processing method 1000 using the feature vector, in step 1001 of receiving source song sound information from the input unit 100, the input source song sound information is analyzed to analyze lyrics of the song sound according to time. Step 1002 of classifying the information, the pitch information, and the volume information, and the target having the lyrics information and the pitch information corresponding to the lyrics information and the pitch information of the source song sound information from the target song sound information stored in the storage unit 200. Matching and acquiring the song sound information (1003), acquiring the amplified target song sound information by acquiring the amplified target song sound information according to the volume information of the source song sound information (1004), and obtaining the amplified target A step 1005 of transmitting the song sound information to the output unit 300 may be included.

입력된 소스 노래 소리 정보를 분석하여 시간에 따른 노래 소리의 가사 정보, 음높이 정보 및 볼륨 정보를 분류하는 단계(1002)는 피부진동센서(110)를 통하여 입력되는 아날로그 소스 노래 소리 정보를 분석을 위하여 디지털로 변환하는 전처리 단계(1002a), 전처리 단계를 거친 소스 노래 소리 정보의 특징데이터를 추출하는 단계(1002b) 및 추출된 특징데이터를 소스테이블을 이용하여 분류하는 단계(1002c)를 포함할 수 있다.In operation 1002, the lyrics information of the song sound, the pitch information, and the volume information of the song sound may be classified by analyzing the input source song sound information to analyze the analog source song sound information input through the skin vibration sensor 110. Pre-processing step 1002a for converting to digital, extracting feature data of the source song sound information passed through the pre-processing step 1002b, and classifying the extracted feature data using a source table 1002c. .

본 발명의 주요 노래 소리 정보는 앞에서 언급하였듯이 가사 정보, 음높이 정보, 볼륨 정보를 포함하고 있다. 노래 소리는 물리적으로 파동이기 때문에 노래의 볼륨 정보는 파동의 진폭에 해당된다. 반면 노래의 가사 정보와 음높이 정보는 파동의 진동수와 밀접히 연관이 있다. As mentioned above, the main song sound information of the present invention includes lyrics information, pitch information, and volume information. Since the sound of a song is a physical wave, the volume information of the song corresponds to the amplitude of the wave. On the other hand, lyrics information and pitch information of a song are closely related to the frequency of the wave.

입력되는 소스 노래 소리 정보는 고속 푸리에 변환(Fast Fourier Transform, FFT)을 한 스펙트럼을 이용하여 가사 정보 및 음높이 정보를 분석하고 분류할 수 있다. 노래 소리를 분석 및 분류하기 위하여 스펙트럼으로부터 에너지, RMS(Root Men Square), SC(Spectral Centroid), MFCC (Mel Frequency Cepstral Coefficient), 캡스트럼(cepstrum) 등의 방법을 이용하여 특징 데이터를 추출한 후 노래 가사 정보와 음높이 정보를 구할 수 있다. The input source song sound information may be analyzed and classified into lyrics information and pitch information by using a fast Fourier transform (FFT) spectrum. To analyze and classify the sound of a song, extract the characteristic data from the spectrum using energy, root men square (RMS), spectral centroid (SC), mel frequency cepstral coefficient (MFCC), cepstrum, etc. Lyric information and pitch information can be obtained.

일예로, MFCC 방법은 노래 소리를 주어진 단위 시간으로 잘게 나누어 그 구간에 대한 스펙트럼을 분석하여 노래 소리의 특징을 추출하는 방법이다. 좀 더 구체적으로 단위 시간 동안 파워 스펙트럼(power spectrum)을 분석하여 소위 멜 스케일(Mel scale)의 파워(power)를 계산한 후 로그를 취하여 나온 결과에 DCT(Discrete Cosine Transform)을 취한 후 계수들(예: 2~13 번 계수)을 구하여 소스 노래 소리 정보의 가사 정보의 특징 데이터를 추출할 수 있다. For example, the MFCC method is a method of dividing a song sound into a given unit time to extract a feature of the song sound by analyzing a spectrum of the interval. More specifically, the power spectrum is analyzed for a unit time to calculate the power of the so-called Mel scale, and the logarithm is taken to calculate the discrete cosine transform (DCT) and then the coefficients ( Example: coefficients 2 to 13 can be obtained to extract feature data of lyrics information of source song sound information.

소스 노래 소리 정보의 음높이 정보는 노래 소리의 기음(fundamental tone)을 찾으면 되는데 일예로 캡스트럼과 같은 방법이 있을 수 있다. 캡스트럼 방법은 노래 소리의 스펙트럼을 구하여 로그를 취한 후 역 푸리에 변환을 한 것으로 노래 소리의 음높이 정보 등을 분류할 때 사용할 수 있다. 그러나 반드시 이 방법에 한정하는 것은 아니다.The pitch information of the source song sound information is to find the fundamental tone of the song sound. For example, there may be a method such as a capstrum. The capstrum method is obtained by taking a log of the spectrum of a song sound and performing an inverse Fourier transform, which can be used to classify the pitch information of the song sound. However, it is not necessarily limited to this method.

위와 같이 스펙트럼을 분석하여 구한 각 특징들을 하나의 성분으로 갖는 특징벡터(feature vector)를 구할 수 있다. 예를 들면 MFCC방법으로 구한 계수 2~13번으로 이루어진 벡터 및 노래 소리의 음높이 값 p를 더한 (c2, ···, c13, p)는 특징벡터의 일 예가 된다.As described above, a feature vector having each feature as a component can be obtained. For example, a vector consisting of coefficients 2 to 13 obtained by the MFCC method and the pitch value p of the song sound (c2, ..., c13, p) are examples of feature vectors.

도 4를 참조하면, 도 2에서는 가사 정보 /a/에 해당되는 노래 소리를 [a]로 표시한 반면 도 4에서는 노래 가사 정보 /a/의 특징값들을 a₁, ···, a_n이라할 때 특징벡터 <a>= <a₁, ···, a_n>으로 표시한다. [a]기호와 마찬가지로 <a>기호도 음계에 따라 다르게 표시할 수 있지만 불필요한 기호의 복잡함을 피하기 위하여 같은 기호 <a>로 표시하였다. Referring to FIG. 4, in FIG. 2, the song sound corresponding to the lyrics information / a / is represented by [a], while in FIG. 4, characteristic values of the song lyrics information / a / are a ₁ ,..., A _n . when represented by a feature vector _{_{<a> = <a 1, ···, a n}} >. Like the [a] symbol, the <a> symbol may be displayed differently according to the scale, but the same symbol <a> is indicated to avoid unnecessary symbol complexity.

도 4의 특징벡터들은 단위 시간 동안 입력된 알고 있는 노래 소리 정보의 특징벡터들로 저장부(200)에 미리 저장해 둘 수 있다. 소스 노래 소리 정보들에 대한 특징벡터들이 도 4와 같은 소스 노래 소리 정보 테이블로 저장되어 있는 상황에서, 입력된 소스 노래 소리 정보를 분석하여 나온 특징벡터 결과를 저장된 값과 비교하여 노래 소리의 가사 정보와 노래 소리의 음높이 정보를 판별하여 분류할 수 있다(1002c). 이때 저장된 특징벡터들과 비교하는 방법으로 가장 특징이 근접한 k개의 이웃들로부터 해당 노래 소리의 가사 정보와 음높이 정보를 결정하는 k-NN(k-Nearest Neighbors)알고리즘을 사용할 수 있다. 이러한 알고리즘을 계산하는데 필요한 변수들이 저장부(200)에 저장될 수 있다.The feature vectors of FIG. 4 may be stored in advance in the storage unit 200 as feature vectors of known song sound information input during a unit time. When the feature vectors for the source song sound information are stored in the source song sound information table as shown in FIG. 4, the lyrics information of the song sound is compared with the stored value by comparing the result of the input feature song sound information with the stored values. And the pitch information of the song sound can be determined and classified (1002c). In this case, the k-Nearest Neighbors (k-NN) algorithm may be used to determine the lyrics information and the pitch information of the song sound from k neighbors having the closest feature as a method of comparing the stored feature vectors. Variables necessary to calculate such an algorithm may be stored in the storage unit 200.

이하에서는 본 발명의 제1실시예의 변형례에 따른 음성 악기를 통한 제어부의 음성 처리 방법을 도면을 참조하여 설명한다. Hereinafter, a voice processing method of a controller through a voice musical instrument according to a modification of the first embodiment of the present invention will be described with reference to the drawings.

도 5는 본 발명의 제1실시예의 변형례에 따른 음성 악기의 인공지능 알고리즘을 적용한 제어부의 기능을 나타내는 흐름도이고, 도 6a의 (a)는 본 발명의 제1실시예의 변형례에 따른 컨벌류션 신경망을 이용한 인공지능 학습 방법을 나타낸 흐름도이고, 도 6a의 (b)는 컨벌류션 신경망의 다층 레이어를 이용한 변환 과정을 나타낸 흐름도이고, 도 6b는 본 발명의 제1실시예의 변형례에 따른 생성적 적대 네트워크를 이용한 인공지능 학습 방법을 나타낸 흐름도이다.5 is a flowchart showing the function of the controller to which the artificial intelligence algorithm of the voice musical instrument according to the modification of the first embodiment of the present invention is applied, and FIG. 6A (a) shows a convolution according to the modification of the first embodiment of the present invention. 6A is a flowchart illustrating an AI learning method using a neural network, and FIG. 6A (b) is a flowchart illustrating a transformation process using a multilayer layer of a convolutional neural network, and FIG. 6B is a generative example according to a modification of the first embodiment of the present invention. It is a flowchart illustrating an AI learning method using a hostile network.

도 5를 참조하면, 본 발명의 제1실시예의 변형례에 따른 인공지능 알고리즘을 이용한 음성 처리 방법(1000A)은 입력부(100)로부터 소스 노래 소리 정보를 입력받는 단계(1001), 입력된 소스 노래 소리 정보를 분석하여 시간에 따른 노래 소리의 가사 정보, 음높이 정보 및 볼륨 정보를 분류하는 단계(1002A), 저장부(200)에 저장되어 있는 타겟 노래 소리 정보에서 소스 노래 소리 정보의 가사 정보와 음높이 정보에 대응하는 가사 정보와 음높이 정보를 갖는 타겟 노래 소리 정보를 매칭하여 획득하는 단계(1003), 획득된 타겟 노래 소리 정보를 소스 노래 소리 정보의 볼륨 정보에 따라 증폭하여 증폭된 타겟 노래 소리 정보를 획득하는 단계(1004) 및 획득한 증폭된 타겟 노래 소리 정보를 출력하기 위하여 출력부(300)로 송신하는 단계(1005)를 포함할 수 있다. 본 발명의 제1실시예에 따른 특징벡터를 이용한 음성 처리 방법(1000)과 본 발명의 제1실시예의 변형례에 따른 인공지능 알고리즘을 이용한 음성 처리 방법(1000A)에서 공통되는 입력부(100)로부터 소스 노래 소리 정보를 입력받는 단계(1001), 저장부(200)에 저장되어 있는 타겟 노래 소리 정보에서 소스 노래 소리 정보의 가사 정보와 음높이 정보에 대응하는 가사 정보와 음높이 정보를 갖는 타겟 노래 소리 정보를 매칭하여 획득하는 단계(1003), 획득된 타겟 노래 소리 정보를 소스 노래 소리 정보의 볼륨 정보에 따라 증폭하여 증폭된 타겟 노래 소리 정보를 획득하는 단계(1004) 및 획득한 증폭된 타겟 노래 소리 정보를 출력하기 위하여 출력부(300)로 송신하는 단계(1005)는 동일한 기능을 수행할 수 있다.Referring to FIG. 5, in the voice processing method 1000A using the artificial intelligence algorithm according to the modification of the first embodiment of the present invention, the step 1001 of receiving source song sound information from the input unit 100 may be performed. Classifying lyrics information, pitch information, and volume information of the song sound over time by analyzing the sound information (1002A), the lyrics information and the pitch of the source song sound information from the target song sound information stored in the storage unit 200; Acquiring the target song sound information having the lyrics information corresponding to the information and the target information (1003), and amplifying the obtained target song sound information according to the volume information of the source song sound information to obtain the amplified target song sound information. The method may include a step 1004 of obtaining and a step 1005 of transmitting the amplified target song sound information to the output unit 300. From the input unit 100 which is common in the speech processing method 1000 using the feature vector according to the first embodiment of the present invention and the speech processing method 1000A using the artificial intelligence algorithm according to the modification of the first embodiment of the present invention. Step 1001 of receiving the source song sound information, the target song sound information having the lyrics information and the pitch information corresponding to the lyrics information and the pitch information of the source song sound information from the target song sound information stored in the storage unit 200. Acquiring (1003) and obtaining the amplified target song sound information by amplifying the acquired target song sound information according to the volume information of the source song sound information (1004), and obtaining the amplified target song sound information. In step 1005 of transmitting to the output unit 300 to output the same function may be performed.

이때, 입력된 소스 노래 소리 정보를 분석하여 시간에 따른 노래 소리의 가사 정보, 음높이 정보 및 볼륨 정보를 분류하는 단계(1002A)는 인공지능 알고리즘을 적용할 수 있다. 인공지능 알고리즘을 적용하여 분류하는 방법은 피부진동센서(110)를 통하여 입력되는 아날로그 소스 노래 소리 정보를 분석을 위하여 디지털로 변환하는 전처리 단계(1002a), 전처리 단계를 거친 소스 노래 소리 정보에 인공지능 알고리즘을 적용하는 단계(1002d) 및 인공지능을 이용하여 분류하는 단계(1002e)를 포함할 수 있다.At this time, by analyzing the input source song sound information to classify the lyrics information, pitch information and volume information of the song sound over time may be applied to the artificial intelligence algorithm. The method of classifying by applying an artificial intelligence algorithm includes preprocessing step 1002a of converting analog source song sound information input through the skin vibration sensor 110 into digital for analysis, and artificial intelligence in the source song sound information that has been subjected to the preprocessing step. Applying an algorithm 1002d and classifying using artificial intelligence 1002e may be included.

인공지능 학습을 위하여 소스 노래 소리 정보를 소스 노래 소리 정보 x(1101)라 할 때, 이에 대응되는 타겟 노래 소리 정보 y(1103)를 포함하는

쌍으로 이루어진 노래 소리 정보 분포

를 획득할 수 있다. 타겟 노래 소리 정보 y(1103)는 가사 정보, 음높이 정보 및 볼륨 정보 등이 포함될 수 있다. 다만, 후술할 제2실시예처럼 제2입력장치(130)가 있는 경우, 타겟 노래 소리 정보 y(1103)는 가사 정보와 음높이 정보만으로 이루어 질 수 있다. 또한, 제1 및 제2입력장치(120, 130)가 있는 경우에는, 타겟 노래 소리 정보 y(1103)는 가사 정보로만 이루어 질 수 있다. 나아가 타겟 노래 소리 정보 y(1103)에서 분류하고자 하는 정보의 개수는 K라고 할 수 있다. 예컨대, 가사 정보만을 분류한다면 도 2에서 타겟 노래 소리 정보의 개수 N이 타겟 노래 소리 정보 y(1103)에서 분류하고자 하는 정보의 개수 K가 된다.When the source song sound information x (1101) is referred to as source song sound information for AI learning, the target song sound information y (1103) corresponding thereto is included.

Paired Song Sound Information Distribution

Can be obtained. The target song sound information y 1103 may include lyrics information, pitch information, volume information, and the like. However, when the second input device 130 is present as in the second embodiment to be described later, the target song sound information y 1103 may be composed of only lyrics information and pitch information. In addition, when the first and

second input devices

120 and 130 are present, the target song sound information y 1103 may be composed only of lyrics information. Further, the number of information to be classified in the target song sound information y 1103 may be referred to as K. For example, if only the lyrics information is classified, the number N of the target song sound information in FIG. 2 is the number K of information to be classified in the target song sound information y 1103.

이와 같은 상황에서 인공지능 학습 방법은 다양한 인공지능 신경망을 구성하여 학습할 수 있다. In such a situation, the AI learning method can learn by constructing various AI neural networks.

도 6a를 참조하면, 컨벌류션 신경망을 이용한 학습 방법(1100)은 소스 노래 소리 정보 x(1101)를 필터를 이용하여 섞는 2D 컨벌류션(Conv2D)(1102a, 1102c, 1102g, 1102i), 비선형으로 변환하는 활성화 함수(activation function)인 렐루(Rectified Linear Unit, ReLU)(1102b, 1102d, 1102h, 1102j, 1102n), 주어진 영역에서 최대값을 선택하여 정보의 크기를 줄이는 맥스 풀링(Max Pooling)(1102e, 1102k), 인공지능 학습의 수렴이 용이하도록 하기 위하여 한 레이어의 뉴론들에서 그 다음 레이어의 뉴론들로 연결되는 연결망 중 일정 비율만 선택하는 드롭아웃(Dropout)(1102f, 1102l, 1102o), 한 레이어의 뉴론들과 그 다음 레이어의 뉴론들 모두를 연결하는 변환을 하는 FC(Fully Connected)(1102m, 1102p), 입력되는 데이터에 대해 타겟 노래 소리 정보 y(1103)에서 분류하고자 하는 정보의 개수 K 중 어느 하나에 속할 확률을 주는 소프트맥스(Softmax)(1102q) 등과 같은 수학적 변환 연산들이 각각 한 개씩의 레이어를 이루는 다층 레이어의 조합으로 이루어질 수 있다. 예컨대, 컨벌류션 신경망을 이용한 학습 방법(1100)에서 컨벌류션 신경망

(1102)은 도 6a의 (b)에서 볼 수 있듯이 conv2D(1102a), ReLU(1102b), Conv2D(1102c), ReLU(1102d), Max Pooling(1102e), Dropout(1102f), conv2D(1102g), ReLU(1102h), Conv2D(1102i), ReLU(1102j), Max Pooling(1102k), Dropout(1102l), FC(1102m), ReLU(1102n), Dropout(1102o), FC(1102p), Softmax(1102q)와 같은 복수의 레이어들을 순차적으로 포함할 수 있다. 여기서 레이어들의 각 연산에 입출입되는 데이터의 차원에 대한 변환과 기타 다양한 옵션 등이 있을 수 있으나 본 발명의 핵심은 아니므로 불필요한 복잡성을 피하기 위하여 생략하였다. 또한, 상기 레이어 구성은 다양하게 변형될 수 있음은 당업자에게 자명하다.Referring to FIG. 6A, a learning method using a convolutional neural network 1100 is a 2D convolution (Conv2D) 1102a, 1102c, 1102g, 1102i, which mixes source song sound information x (1101) using a filter, and converts it to nonlinearity. Rectified Linear Unit (ReLU) (1102b, 1102d, 1102h, 1102j, 1102n), which is an activation function that maximizes the size of the information by selecting the maximum value in a given area,

Max Pooling

1102e, 1102k), Dropout (1102f, 1102l, 1102o), one layer that selects only a percentage of the network from one layer of neurons to the next layer of neurons to facilitate convergence of AI learning FC (Fully Connected) (1102m, 1102p) which converts all the neurons of the next layer and the neurons of the next layer, of the number of information K to be classified in the target song sound information y (1103) for the input data. The probability of belonging to any one You can be configured with a SoftMax (Softmax) a combination of a multi-layered layer constituting the mathematical transformation operation to each of the layers one by one, such as (1102q). For example, in the learning method 1100 using the convolutional neural network, the convolutional neural network

1102, conv2D 1102a, ReLU 1102b, Conv2D 1102c, ReLU 1102d, Max Pooling 1102e, Dropout 1102f, conv2D 1102g, as shown in FIG. ReLU (1102h), Conv2D (1102i), ReLU (1102j), Max Pooling (1102k), Dropout (1102l), FC (1102m), ReLU (1102n), Dropout (1102o), FC (1102p), Softmax (1102q) It may include a plurality of layers in sequence. Here, there may be a transformation and various other options for the dimension of data input and output in each operation of the layers, but are omitted in order to avoid unnecessary complexity since it is not the core of the present invention. In addition, it will be apparent to those skilled in the art that the layer configuration may be variously modified.

입력된 소스 노래 소리 정보 x(1101)가 도 6a의 (b)와 같은 다층 레이어를 거치며 변환을 할 때 필요한 복수의 변수들을 하나의 인공지능 변수 벡터

로 표현할 수 있다. The input source song sound information x 1101 passes through a multi-layered layer as shown in FIG. 6A (b).

Can be expressed as

컨벌류션 신경망을 이용한 학습 방법(1100)은 노래 소리 정보 분포

에 위치하는 소스 노래 소리 정보 x(1101)에 대하여 상기와 같은 컨벌류션 신경망

(1102)을 통하여 변환된 값

가 타겟 노래 소리 정보 y(1103)의 y = 1, 2, …, K 중 어느 하나가 될 확률인

에 대한 [수학식 1]과 같은 크로스 엔트로피 로스(cross-entropy loss)를 포함하는 비용함수를 최소화되도록 학습시킬 수 있다. 이때, 컨벌류션 신경망

(1102)를 표현하는데 필요한 인공지능 변수 벡터

가 학습되어 최적화될 수 있다. 대응되는 타겟 노래 소리 정보 y(1103)를 알고 있는 소스 노래 소리 정보 x(1101)는 소스 노래 소리 정보

로 표현될 수 있다. Learning method using the convolutional neural network (1100) is the distribution of song sound information

The convolutional neural network as described above for the source song sound information x (1101) located at

Value converted via (1102)

Y = 1, 2,... Of the target song sound information y (1103); , Which is the probability that either K

It can be learned to minimize the cost function including the cross-entropy loss (Equation 1) for. At this time, the convolutional neural network

AI variable vector required to represent (1102)

Can be learned and optimized. Source song sound information x (1101) knowing the corresponding target song sound information y (1103) is source song sound information.

It can be expressed as.

이 경우, 저장부(200)에는 도 4의 소스 노래 소리 정보 테이블을 구성하는 특징벡터들 대신에 최적화된 인공지능 변수 벡터

가 저장될 수 있다. 상기 학습되어 저장된 인공지능 변수 벡터

를 이용하여 입력부(100)로부터 단위 시간 동안 입력되는 소스 노래 소리 정보 x(1101)를 변환하여 분류할 수 있다. In this case, the storage unit 200 has an AI parameter vector optimized in place of the feature vectors constituting the source song sound information table of FIG. 4.

Can be stored. AI variable vector learned and stored

The source song sound information x 1101 input through the input unit 100 for a unit time may be converted and classified by using.

[수학식 1] [Equation 1]

도 6b를 참조하면, 생성적 적대 네트워크를 이용한 학습 방법(1200)은 컨벌류션 신경망을 이용한 학습 방법(1100)과는 다르게 두 개의 신경망, 즉 판별자 네트워크

(1202)와 생성자 네트워크

(1205)로 이루어질 수 있다. 판별자 네트워크

(1202)는 예컨데 conv2D, Add b1, ReLU, Ave Pooling, conv2D, Add b2, ReLU, Ave Pooling, MatMul, Add b3, ReLU, MatMul, Add b4와 같은 복수의 레이어들을 순차적으로 포함할 수 있다(미도시). 여기서 Add b1, b2, b3, b4는 b1, …, b4와 같은 변수값을 추가로 더하는 연산이며, Ave Pooling은 도 6a의 (b)의 Max Pooling(1102e, 1102k)과 다르게 주어진 영역에서 최대값 대신 평균값을 선택하여 정보의 크기를 줄이는 연산이며, MatMul연산은 행렬을 곱하는 연산이다. 여기서 레이어의 각 연산에 입출입되는 데이터의 차원에 대한 고려 및 기타 다양한 옵션 등이 있을 수 있으나 본 발명의 핵심은 아니므로 불필요한 복잡성을 피하기 위하여 생략하였다. 또한, 상기 레이어 구성은 다양하게 변형될 수 있음은 당업자에게 자명하다.Referring to FIG. 6B, the learning method 1200 using the generative antagonist network has two neural networks, that is, the discriminator network, unlike the learning method 1100 using the convolutional neural network.

Constructor network with 1202

1205. Discriminator network

1202 may sequentially include a plurality of layers such as conv2D, Add b1, ReLU, Ave Pooling, conv2D, Add b2, ReLU, Ave Pooling, MatMul, Add b3, ReLU, MatMul, Add b4 (not shown). city). Where Add b1, b2, b3, b4 are b1,... , b4 is an operation of adding a variable value such as A4, and Ave pooling is an operation of reducing the size of information by selecting an average value instead of the maximum value in a given region, unlike Max Pooling (1102e, 1102k) of FIG. The MatMul operation is a multiplication operation of a matrix. Here, there may be a consideration about the dimension of the data input and output to each operation of the layer and various other options, but it is not the core of the present invention, so it is omitted to avoid unnecessary complexity. In addition, it will be apparent to those skilled in the art that the layer configuration may be variously modified.

생성자 네트워크

(1205)는 예컨데 MatMul, Add c1, BN, ReLU, conv2D, Add c2, BN, ReLU, conv2D, Add c3, BN, ReLU, conv2D, Add c4, sigmoid와 같은 복수의 레이어들을 순차적으로 포함할 수 있다(미도시). 여기서 BN(Batch Normalization)은 도 6a의 (b)의 Dropout(1102f, 1102l, 1102o)처럼 인공지능 학습의 수렴이 용이하도록 하는 변환 중 하나이며, Sigmoid는 도 6a의 (b)의 ReLU(1102b, 1102d, 1102h, 1102j, 1102n) 와 같은 비선형 변환인 활성화 함수이다. 여기서 레이어의 각 연산에 입출입되는 데이터의 차원에 대한 고려 및 기타 다양한 옵션 등이 있을 수 있으나 본 발명의 핵심은 아니므로 불필요한 복잡성을 피하기 위하여 생략하였다. 또한, 상기 레이어 구성은 다양하게 변형될 수 있음은 당업자에게 자명하다.Constructor network

1205 may sequentially include a plurality of layers such as, for example, MatMul, Add c1, BN, ReLU, conv2D, Add c2, BN, ReLU, conv2D, Add c3, BN, ReLU, conv2D, Add c4, and sigmoid. (Not shown). Here, BN (Batch Normalization) is one of transformations to facilitate convergence of AI learning, such as Dropout (1102f, 1102l, 1102o) of FIG. 6A, and Sigmoid is ReLU 1102b, of FIG. 6B. 1102d, 1102h, 1102j, 1102n). Here, there may be a consideration about the dimension of the data input and output to each operation of the layer and various other options, but it is not the core of the present invention, so it is omitted to avoid unnecessary complexity. In addition, it will be apparent to those skilled in the art that the layer configuration may be variously modified.

생성적 적대 네트워크를 이용한 학습 방법(1200)은 노래 소리 정보 분포가 두 종류로 이루어질 수 있다. 구체적으로, 소스 노래 소리 정보를 소스 노래 소리 정보 x(1201)라 할 때, 이에 대응되는 타겟 노래 소리 정보 y(1203)를 포함하는

쌍으로 이루어진 노래 소리 정보 분포

와, 소스 노래 소리 정보 x(1201)에 대응되는 타겟 노래 소리 정보 y(1203)는 모르고 소스 노래 소리 정보 x(1201)만으로 이루어진 노래 소리 정보 분포

로 이루어 질 수 있다. 나아가 분류할 타겟 노래 소리 정보 y(1203)에서 분류하고자 하는 정보의 개수는 컨벌류션 신경망을 이용한 학습 방법(1100)에서처럼 타겟 노래 소리 정보 y(1203)의 참(1203a)에 해당되는 y = 1, 2, …, K의 K개인 것 외에 거짓(1203b)에 해당되는 y = K+1인 것 1개를 더 포함하여 K+1개가 될 수 있다. In the learning method 1200 using the generative antagonist network, distribution of song sound information may be performed in two types. Specifically, when the source song sound information is referred to as the source song sound information x (1201), the target song sound information y (1203) corresponding thereto is included.

Paired Song Sound Information Distribution

And the song sound information distribution comprising only the source song sound information x 1201 without knowing the target song sound information y 1203 corresponding to the source song sound information x 1201.

Can be done with. Further, the number of information to be classified in the target song sound information y 1203 to be classified is y = 1, corresponding to true 1203a of the target song sound information y 1203, as in the learning method 1100 using the convolutional neural network. 2, … , K may be K + 1, in addition to K being K, and one more y = K + 1 corresponding to false 1203b.

이러한 상황에서, 노래 소리 정보 분포

에 위치하는 소스 노래 소리 정보 x(1201)는 판별자 네트워크

(1202)를 통하여 변환된 값

가 타겟 노래 소리 정보 y(1203)의 참(1203a)에 해당되는 y = 1,2, …, K 중 어느 하나가 될 확률

에 대하여 [수학식 1]과 같은 크로스 엔트로피 로스와, 노래 소리 정보 분포

에 위치하는 소스 노래 소리 정보 x(1201)가 판별자 네트워크

(1202) 통하여 변환된 값

가 타겟 노래 소리 정보 y(1203)의 참(1203a)에 해당되는 y = 1, 2, …, K에 속하는 확률, 즉 타겟 노래 소리 정보 y(1203)의 거짓(1203b)에 해당되는 y = K+1에 속하지 않도록 하는 확률

에 대한 [수학식 2]와 같은 크로스 엔트로피 로스와, 노이즈 데이터 z(1204)를 생성자 네트워크

(1205)를 통하여 변환한 후 생성된 소리 정보

(1206)를 판별자 네트워크

(1202)를 거쳐 노래 소리가 거짓(1203b)인 것처럼, 즉 타겟 노래 소리 정보 y(1203)의 거짓(1203b)에 해당되는 y = K+1에 속하도록 하는 확률에 대한 [수학식 3]과 같은 크로스 엔트로피 로스를 포함하는 비용함수의 조합이 최소화가 되도록 생성자 네트워크

(1205)를 고정시키고 판별자 네트워크

(1202)를 일정 횟수 학습시킬 수 있다.In this situation, song sound information distribution

Source song sound information x (1201) located in the discriminator network

Value converted through (1202)

Corresponds to true 1203a of target song sound information y 1203,. , Any probability of K

With respect to the cross entropy loss such as [Equation 1], song sound information distribution

Source song sound information x (1201) located in the discriminator network

Value converted through (1202)

Corresponds to true 1203a of target song sound information y 1203, y = 1, 2,... , The probability belonging to K, that is, the probability not to belong to y = K + 1 corresponding to the false 1203b of the target song sound information y 1203

Network generator with cross-entropy loss, noise data z (1204), such as

Sound information generated after the conversion through 1205

Discriminator Network (1206)

Equation 3 for the probability that through 1202 the song sound is false 1203b, i.e., y = K + 1 corresponding to false 1203b of the target song sound information y 1203. Constructor network so that the combination of cost functions containing the same cross entropy loss is minimized

Fixed 1205 and discriminator network

1202 can be learned a certain number of times.

[수학식 2] [Equation 2]

[수학식 3][Equation 3]

나아가 노이즈 데이터 z(1204)를 생성자 네트워크

(1205)를 통하여 변환한 후 생성된 소리 정보

(1206)를 판별자 네트워크

(1202)를 거쳐 노래 소리가 실제(real)인 것처럼, 즉 타겟 노래 소리 정보 y(1203)의 참(1203a)에 해당되는 y = 1, 2, …, K에 속하도록 하는 확률에 대한 [수학식 4]와 같은 크로스 엔트로피 로스를 포함하는 비용함수가 최소화되도록 판별자 네트워크

(1202)를 고정시키고 생성자 네트워크

(1205)를 일정 횟수 학습시킬 수 있다.Further generation of noise data z (1204) network

Sound information generated after the conversion through 1205

Discriminator Network (1206)

Through 1202, y = 1, 2,... That corresponds to the true 1203a of the target song sound information y 1203, as if the song sound is real. , Discriminator network to minimize cost functions including cross-entropy losses, such as Eq.

Fixed 1202 constructor network

1205 can be learned a certain number of times.

[수학식 4][Equation 4]

상기와 같이 판별자 네트워크

(1202)와 생성자 네트워크

(1205)에 대한 인공지능 학습을 번갈아 가며 학습시키면 판별자 네트워크

(1202)와 생성자 네트워크

(1205)를 표현하는데 필요한 복수의 변수로 이루어진 판별자 네트워크 인공지능 변수 벡터

와 생성자 네트워크 인공지능 변수 벡터

가 학습되어 최적화될 수 있다. 이 경우, 저장부(200)에는 도 4의 소스 노래 소리 정보의 테이블을 구성하는 특징벡터들 대신에 최적화된 판별자 네트워크 인공지능 변수 벡터

가 저장될 수 있다. 학습되어 저장된 최적화된 판별자 네트워크 인공지능 변수 벡터

를 이용하여 입력부(100)로부터 단위 시간 동안 입력되는 소스 노래 소리 정보 x(1201)를 변환하여 분류할 수 있다. 물론, 저장부(200)에는 생성자 네트워크 인공지능 변수 벡터

도 저장할 수 있고, 이를 이용해서는 타겟 노래 소리 정보를 생성할 수 있다. 즉, 생성적 적대 네트워크를 이용한 학습 방법(1200)을 이용한 학습은 노이즈 데이터 z(1204)로부터 생성된 소리 정보

(1206)가 참(1203a)이 될 확률 0.5, 거짓(1203b)이 될 확률 0.5에서 종료되는데 이때 생성자 네트워크

(1205)를 통하여 생성된 소리 정보

(1206)는 참(1203a)과 거의 구분이 불가능하게 된다. 따라서 타겟 노래 소리 정보 y(1203)의 가사 정보 및 음높이 정보까지 분류하는 생성적 적대 네트워크를 이용한 학습 방법(1200)에서 생성된 소리 정보

(1206)은 타겟 노래 소리 정보로 활용될 수 있다. Discriminator network as above

Constructor network with 1202

Alternate AI Learning for (1205) for Discriminator Networks

Constructor network with 1202

Discriminator network AI variable vector consisting of a plurality of variables required to represent (1205)

And constructor network AI variable vector

Can be learned and optimized. In this case, the storage unit 200 has an optimized discriminator network AI parameter vector instead of feature vectors constituting the table of source song sound information of FIG. 4.

Can be stored. Learned and Stored Optimized Discriminator Network AI Variable Vectors

The source song sound information x 1201 input from the input unit 100 for a unit time may be converted and classified by using. Of course, the storage unit 200 has a generator network AI variable vector

It can also be stored, and can be used to generate target song sound information. That is, the learning using the learning method 1200 using the generative antagonist network includes sound information generated from the noise data z 1204.

(1206) ends with a probability 0.5 of true 1203a, and a probability 0.5 of false 1203b, where the producer network

Sound information generated through 1205

1206 becomes almost indistinguishable from true 1203a. Therefore, the sound information generated by the learning method 1200 using the generative hostile network classifying the lyrics information and the pitch information of the target song sound information y 1203.

1206 may be used as target song sound information.

상기와 같이 판별자 네트워크

(1202)와 생성자 네트워크

(1205)에 대한 인공지능 학습을 번갈아 가며 학습시키는 생성적 적대 네트워크를 이용한 학습 방법(1200)은 컨벌류션 신경망을 이용한 학습 방법(1100)보다 적은 수의 노래 소리 정보 분포

를 가지고도 소스 노래 소리 정보 x(1201)를 더 잘 분류할 수도 있다.Discriminator network as above

Constructor network with 1202

The learning method 1200 using a generative hostile network alternately learning AI learning about 1205 has less distribution of song sound information than the learning method 1100 using a convolutional neural network.

Even with the source song sound information x (1201) can be classified better.

이때 각 소스 노래 소리 정보 x(1101, 1201)는 노래 소리의 파형 자체가 될 수도 있지만 이것을 변환한 도 4의 <a> 와 같은 특징벡터 또는 노래 소리의 파형을 시간과 진동수 평면에서 픽셀의 밝기로 표현한 2차원 이미지인 스펙토그램(spectrogram) 등이 될 수 있다. At this time, each source song sound information x (1101, 1201) may be the waveform of the song sound itself, but the feature vector or the sound of the song sound such as <a> of FIG. 4 is converted to the brightness of the pixel in the time and frequency planes. It may be a spectrogram, which is a expressed two-dimensional image.

여기서 소스 노래 소리 정보 x(1101, 1201)의 볼륨 정보를 따로 분석하여 분류할 수 있듯이, 타겟 노래 소리 정보 y(1103, 1203)에서 분류하고자 하는 정보의 개수인 K에 소스 노래 소리 정보 x(1101, 1201)의 음높이 정보도 포함할 수 있지만, 소스 노래 소리 정보 x(1101, 1201)의 가사 정보만 포함시키고 음높이 정보는 상술한 캡스트럼 방법으로 따로 분석하여 분류할 수도 있다. 즉, 이 경우도 도 2 에서처럼 가사 정보의 수가 N이라면 K = N으로 하고, 노래 소리의 가사 정보만을 인공지능 학습시킬 수 있다. Here, as the volume information of the source song sound information x (1101, 1201) can be analyzed and classified separately, the source song sound information x (1101) is set to K, which is the number of information to be classified in the target song sound information y (1103, 1203). 1201) may also include pitch information, but only the lyrics information of the source song sound information x (1101, 1201) may be included and the pitch information may be separately analyzed and classified by the above-described capstrum method. That is, even in this case, as shown in FIG. 2, if the number of lyrics information is N, K = N, and only the lyrics information of the song sound can be artificially learned.

입력된 소스 노래 소리 정보를 분석하여 시간에 따른 소스 노래 소리 정보의 가사 정보, 음높이 정보 및 볼륨 정보를 분류한 후, 저장부(200)에 저장되어 있는 타겟 노래 소리 정보에서 소스 노래 소리 정보와 같은 가사 정보와 같은 음높이 정보를 갖는 노래 소리 정보를 매칭하여 획득할 수 있다(1003). 이때, 도 3의 음성 처리 방법에 의하여 소스 노래 소리 정보 테이블을 구성하는 특징벡터들로 이루어져 저장부(200)에 저장된 경우는 소스 노래 소리 정보 테이블의 해당 위치에 대응되는 타겟 노래 소리 정보 테이블의 각 요소로 매칭될 수 있다. 반면 도 5의 음성 처리 방법에 의하여 인공지능 변수 벡터가 저장된 경우는 컨벌류션 신경망

(1102) 또는 판별자 네트워크

(1202)의 계산 결과에 의하여 분류되어 타겟 노래 소리 정보 테이블의 각 요소로 매칭될 수 있다. 물론 인공지능 변수 벡터 외에 인공지능 상수값 등 인공지능 알고리즘 관련 데이터들도 저장부(200)에 저장될 수 있다. 이때, 인공지능 변수 벡터는 컨벌류션 신경망을 이용한 학습 방법(1100)에 의한 변수 벡터

, 생성적 적대 네트워크를 이용한 학습 방법(1200)에 의한 판별자 네트워크 인공지능 변수 벡터

및 생성자 네트워크 인공지능 변수 벡터

를 포함할 수 있다. After analyzing the input source song sound information and classifying lyrics information, pitch information, and volume information of the source song sound information over time, the target song sound information stored in the storage unit 200, such as source song sound information Song sound information having pitch information such as lyrics information may be matched and acquired (1003). In this case, when the voice processing method of FIG. 3 includes the feature vectors constituting the source song sound information table and is stored in the storage unit 200, each of the target song sound information tables corresponding to the corresponding position of the source song sound information table is stored. Can be matched to an element. On the other hand, when the artificial intelligence variable vector is stored by the speech processing method of FIG. 5, the convolutional neural network

(1102) or discriminator network

The result of the calculation of 1202 may be classified to match each element of the target song sound information table. Of course, in addition to the AI variable vector, AI algorithm related data such as AI constant values may also be stored in the storage unit 200. At this time, the AI variable vector is a variable vector by a learning method 1100 using a convolutional neural network.

Discriminator Network AI Variable Vector by Learning Method Using Genetic Hostile Network (1200)

And constructor network AI variables vector

It may include.

출력되는 노래 소리의 파형이 연속적으로 연결되도록 저장부(200)에 저장된 타겟 노래 소리 정보들의 단위 시간은 입력되는 소스 노래 소리 정보의 단위 시간 보다 더 길 수 있다. 이러한 경우 단위 시간 동안 입력되는 소스 노래 소리 정보에 매칭되는 타겟 노래 소리 정보는 바로 직전에 매칭된 타겟 노래 소리 정보의 파형의 끝과 연속적으로 연결되도록 타겟 노래 소리 정보 테이블의 해당 타겟 노래 소리 정보로부터 클립핑(clipping)하여 매칭시켜 획득할 수 있다. 또한, 입력되는 소스 노래 소리 정보를 겹쳐가며 획득하여 분석 및 분류할 수도 있다. The unit time of the target song sound information stored in the storage 200 may be longer than the unit time of the input source song sound information so that the waveform of the output song sound is continuously connected. In this case, the target song sound information matching the source song sound information input for the unit time is clipped from the corresponding target song sound information in the target song sound information table so that it is continuously connected with the end of the waveform of the immediately matched target song sound information. It can be obtained by clipping. In addition, the input source song sound information may be superimposed and analyzed and classified.

획득된 타겟 노래 소리 정보를 분류된 볼륨 정보에 따라 증폭하고(1004), 증폭된 타겟 노래 소리 정보를 출력하기 위하여 출력부(300)로 송신한다(1005). 전기적 신호를 증폭하는 과정과 출력부(300)로 송신하는 과정은 기존에 사용되는 잘 알려진 방법들이므로 자세한 설명은 생략한다.The obtained target song sound information is amplified according to the classified volume information (1004) and transmitted to the output unit 300 to output the amplified target song sound information (1005). Amplifying the electrical signal and transmitting the output signal to the output unit 300 are well known methods used in the art, and thus detailed description thereof will be omitted.

이하에서는 본 발명의 제2실시예에 따른 음성 악기의 구성을 도면을 참조하여 설명한다.Hereinafter, the configuration of a voice musical instrument according to a second embodiment of the present invention will be described with reference to the drawings.

도 7은 본 발명의 제2실시예에 따른 음성 악기의 구성도이다.7 is a block diagram of a voice musical instrument according to a second embodiment of the present invention.

본 발명의 제1실시예에 따른 음성 악기(10)는 입력부(100)의 피부진동센서(110)를 통하여 입력되는 소스 노래 소리 정보를 분류하여, 저장부(200)에 미리 저장되어 있는 타겟 노래 소리 정보와 비교하여 매칭시켜 획득하고, 획득된 타겟 노래 소리 정보를 증폭한 후 출력하는 것이다. 이때 피부진동센서(110)를 통하여 입력되는 소스 노래 소리 정보로부터 가사 정보, 음높이 정보 및 볼륨 정보를 모두 획득할 수 있다.The voice instrument 10 according to the first embodiment of the present invention classifies source song sound information input through the skin vibration sensor 110 of the input unit 100, and stores the target song previously stored in the storage unit 200. Compared to the sound information, the matching is obtained, and the obtained target song sound information is amplified and output. In this case, all of the lyrics information, the pitch information and the volume information can be obtained from the source song sound information input through the skin vibration sensor 110.

그러나, 피부진동센서(110)로부터 입력되는 소스 노래 소리 정보로부터는 가사 정보만 획득하고, 음높이 정보를 입력하는 제1입력장치(120)와 볼륨 정보를 입력하는 제2입력장치(130)를 추가하여 소스 노래 소리 정보의 음높이 정보와 볼륨 정보를 직접 입력할 수도 있다.However, only the lyrics information is obtained from the source song sound information input from the skin vibration sensor 110, and a first input device 120 for inputting pitch information and a second input device 130 for inputting volume information are added. It is also possible to directly input the pitch information and volume information of the source song sound information.

도 7을 참조하면, 추가적으로 입력부(100)에 소스 노래 소리 정보의 음높이 정보를 입력하기 위한 제1입력장치(120)를 포함할 수 있다. 피아노의 건반과 같은 건반은 소스 노래 소리 정보의 음높이 정보를 입력하는 제1입력장치(120)의 예라고 볼 수 있다. 물론 피아노 건반이 아닌 터치 버튼식 등 다양한 방법으로 구현할 수 있음을 당연하다. Referring to FIG. 7, the input unit 100 may further include a first input device 120 for inputting pitch information of source song sound information. A keyboard such as a piano keyboard may be regarded as an example of the first input device 120 for inputting pitch information of source song sound information. Of course, the piano keyboard can be implemented in various ways such as touch buttons.

또한, 추가적으로 입력부(100)에 소스 노래 소리 정보의 볼륨 정보를 입력하기 위한 제2입력장치(130)를 더 포함할 수 있다. 직선 저항(132) 위를 전극(131)을 이동하여 전압을 변화시켜 음량을 조절하는 슬라이드 볼륨 조절기는 소스 노래 소리 정보의 볼륨 정보를 입력하는 제2입력장치(130)의 예라고 볼 수 있다. 물론 슬라이드 볼륨 조절기가 아닌 발판 볼륨 조절기, 회전식 볼륨 조절기 등 다양한 방법으로 구현할 수 있음은 당연하다.In addition, the input unit 100 may further include a second input device 130 for inputting volume information of source song sound information. The slide volume controller for adjusting the volume by changing the voltage by moving the electrode 131 on the linear resistor 132 may be an example of the second input device 130 for inputting volume information of the source song sound information. Of course, it can be implemented in a variety of ways, such as scaffolding volume control, rotary volume control rather than the slide volume control.

또한, 저장부(200)에 소스 노래 소리 정보 및/또는 타겟 노래 소리 정보를 포함하는 노래 소리 관련 데이터를 입력, 출력, 선택, 저장, 변경 및/또는 삭제하기 위한 입출력장치(500, 600)가 더 포함될 수 있다. 즉 소스 노래 소리 정보 및/또는 타겟 노래 소리 정보 등 노래 소리 관련 데이터를 입력, 출력, 선택, 저장, 변경 및/또는 삭제하기 위하여 디스플레이 하는 출력장치(500)와 소스 노래 소리 정보 및/또는 타겟 노래 소리 정보 등 노래 소리 관련 데이터를 입력, 출력, 선택, 저장, 변경 및/또는 삭제하기 위한 입력장치(600)를 더 포함할 수 있다.In addition, input and output devices 500 and 600 for inputting, outputting, selecting, storing, changing, and / or deleting song sound related data including source song sound information and / or target song sound information are stored in the storage unit 200. It may be further included. That is, the output device 500 for displaying, inputting, outputting, selecting, storing, changing, and / or deleting song sound related data such as source song sound information and / or target song sound information, and source song sound information and / or target song. The apparatus may further include an input device 600 for inputting, outputting, selecting, storing, changing, and / or deleting song sound related data such as sound information.

나아가 입출력장치(500, 600)를 이용하여 타겟 노래 소리 정보도 두 그룹 이상 저장할 수 있다. 즉 m(m>1)그룹을 저장할 수 있다. 예를 들면, 타겟 노래 가수 T1, ···, Tm(m>1)에 대한 타겟 노래 소리 정보 테이블 m그룹을 저장하고 원하는 가수의 목소리를 입출력장치(500, 600)를 이용하여 선택한 후 선택한 그룹들의 타겟 노래 소리 정보를 매칭하고 증폭하여 출력할 수 있다. 따라서, 이중창, 삼중창 등도 가능하다. 이때 입출력장치(500, 600)를 이용하여 화음에 대한 정보도 입출력 할 수 있음은 자명하다.Furthermore, the target song sound information may be stored in two or more groups using the input / output devices 500 and 600. That is, m (m> 1) group can be saved. For example, a target song sound information table m group for the target song singers T1, ..., Tm (m> 1) is stored, and the selected singer voice is selected using the input / output devices 500 and 600, and then the selected group. Matches and amplifies the target song sound information can be output. Thus, double glazing, triple glazing and the like are also possible. At this time, it is obvious that information on the chord can also be input and output using the input and output devices (500, 600).

나아가, 복수의 사람 S1, …, Sk(k>1)에 대한 소스 노래 소리 정보 테이블 한 개 이상을 저장할 수 있다. 즉 S1이 연주하는 경우 미리 S1의 특징벡터들을 저장부(200)에 저장할 수 있을 뿐 아니라 S2도 연주하기 위하여 S2의 특징벡터들도 저장할 수 있다. 연주하는 사람 본인의 특징벡터들을 저장함으로써 인식의 오차를 더 줄일 수 있다. S2가 T3의 목소리로 노래하고 싶은 경우, 디스플레이와 같은 출력장치(500)와 키보드와 같은 입력장치(600)를 이용하여 이러한 소스 노래 소리 정보 S2나 타겟 노래 소리 정보 T3를 선택할 수 있다.Furthermore, a plurality of people S1,... For example, one or more source song sound information tables for Sk (k> 1) may be stored. That is, when S1 plays, not only the feature vectors of S1 may be stored in the storage unit 200 in advance, but also the feature vectors of S2 may be stored in order to play S2. The error of recognition can be further reduced by storing the feature vectors of the player himself. When S2 wants to sing in the voice of T3, the source song sound information S2 or the target song sound information T3 can be selected using an output device 500 such as a display and an input device 600 such as a keyboard.

또한, 소스 노래 소리 정보 및/또는 타겟 노래 소리 정보를 포함하는 노래 소리 관련 데이터를 다운로드 및/또는 업로드하기 위한 통신장치(700)를 포함할 수 있다. 즉 타겟 목소리 등을 인터넷이나 다른 스마트폰 등과 같은 전자기기들로부터 다운로드 및/또는 업로드하기 위한 통신장치(700)가 포함될 수도 있다.The apparatus 700 may also include a communication device 700 for downloading and / or uploading song sound related data including source song sound information and / or target song sound information. That is, a communication device 700 for downloading and / or uploading a target voice or the like from an electronic device such as the Internet or another smart phone may be included.

이하에서는 본 발명에 따른 제1실시예 및 제2실시예 중 어느 하나의 음성 악기를 통한 음성 처리 방법을 구현하기 위한 컴퓨터 프로그램을 설명한다.Hereinafter, a computer program for implementing a voice processing method using a voice instrument of any one of the first and second embodiments of the present invention will be described.

컴퓨터 프로그램은 본 발명의 제1실시예 및 제2실시예 중 어느 하나의 음성 악기(10)를 통한 음성 처리 방법을 구현하기 위하여 기록매체에 저장된 컴퓨터 프로그램일 수 있다. 음성 처리 방법을 구현하기 위하여 기록매체에 저장된 컴퓨터 프로그램은 범용 컴퓨터 또는 전용 컴퓨터를 통해 실행될 수 있다.The computer program may be a computer program stored in a recording medium in order to implement the voice processing method through the voice instrument 10 of any one of the first and second embodiments of the present invention. In order to implement the voice processing method, a computer program stored in a recording medium may be executed by a general purpose computer or a dedicated computer.

본 발명에 따른 제1실시예 및 제2실시예 중 어느 하나의 음성 악기(10)를 통한 음성 처리 방법은 음성 처리 방법을 컴퓨터 상에서 수행하기 위한 프로그램을 기록한 컴퓨터 판독 가능한 기록 매체에 저장될 수 있다. 기록 매체는 컴퓨터 시스템에 의하여 읽힐 수 있도록 프로그램 및/또는 데이터가 저장되는 모든 종류의 기록매체를 포함한다. 예를 들어, 자기 테이프, 플렉시블 디스크 또는 하드 디스크 등의 자기 디스크, CD 또는 DVD등의 광 디스크, 광자기 디스크(magneto-optical disc), USB 메모리나 메모리 카드 등의 반도체 메모리 등이 있다. 이러한 기록 매체를 사용해서 범용 컴퓨터에 프로그램을 인스톨 하는 것 등에 의해 음성 처리 방법을 실행할 수 있다.The voice processing method via the voice instrument 10 of any one of the first and second embodiments according to the present invention may be stored in a computer readable recording medium having recorded thereon a program for executing the voice processing method on a computer. . Recording media include all types of recording media on which programs and / or data are stored for reading by a computer system. For example, magnetic disks such as magnetic tape, flexible disk or hard disk, optical disk such as CD or DVD, magneto-optical disk, semiconductor memory such as USB memory or memory card. The voice processing method can be executed by installing a program in a general-purpose computer using such a recording medium.

이상 첨부된 도면을 참조하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although embodiments of the present invention have been described above with reference to the accompanying drawings, those skilled in the art to which the present invention pertains may implement the present invention in other specific forms without changing the technical spirit or essential features thereof. I can understand that. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive.

Claims

An input unit configured to receive source song sound information including lyrics information, pitch information, and volume information from a user;
A storage unit for storing song sound related data including target song sound information including lyrics information and pitch information;
Obtaining the target song sound information having the lyrics information of the source song sound information, the lyrics information corresponding to the pitch information, and the pitch information from the storage unit, and obtaining the obtained target song sound information from the source song sound information; A control unit for amplifying the target song sound information by amplifying according to the volume information; And
An output unit for outputting the amplified target song sound information,
The input unit includes a skin vibration sensor for receiving at least one of the lyrics information, the pitch information and the volume information of the source song sound information,
The target song sound information includes tone information different from the source song sound information,
And the controller classifies the lyrics information from the source song sound information input from the input unit by using an AI parameter vector trained using an AI network.

The method of claim 1,
The input unit further includes a first input device for receiving the pitch information of the source song sound information,
And the skin vibration sensor receives the lyrics information and the volume information of the source song sound information when the first input device receives the pitch information of the source song sound information.

The method of claim 1,
The input unit further includes a second input device for receiving the volume information of the source song sound information,
And the skin vibration sensor receives the lyrics information and the pitch information of the source song sound information when the second input device receives the volume information of the source song sound information.

The method of claim 1,
The input unit may include a first input device configured to receive the pitch information of the source song sound information; And
The input unit includes a second input device for receiving the volume information of the source song sound information,
When the first input device receives the pitch information of the source song sound information and the second input device receives the volume information of the source song sound information, the skin vibration sensor is configured to perform the lyrics of the source song sound information. Voice instrument to receive information.

The method of claim 1,
The song sound related data stored in the storage unit includes a target song sound information table having different pitch information for each of a plurality of pieces of lyrics information that can be spoken for a unit time.

The method of claim 1,
And a unit time of the target song sound information stored in the storage unit is longer than a unit time of the source song sound information.

The method of claim 1,
The storage unit stores a plurality of feature vectors or artificial intelligence variable vector for classifying a plurality of lyrics information that can be spoken for a unit time.

The method of claim 1,
And the storage unit includes at least two groups of the target song sound information.

The method of claim 1,
And an input device and an output device for inputting, outputting, selecting, storing, changing, and / or deleting the source song sound information and / or the song sound related data.

The method of claim 1,
And a communication device for downloading and / or uploading the source song sound information and / or the song sound related data.

The method of claim 1,
The artificial intelligence network is a convolutional neural network (Convolution Neural Network, CNN) or a generative adversarial network (GAN).

The method of claim 1,
The target song sound information is generated using a generator network AI variable vector trained with a generative hostile network.

The method of claim 1,
The target song sound information includes rap, pansori, peking opera, accent and / or intonation data.

Receiving at least one of the lyrics information, the pitch information, and the volume information of the source song sound information including lyrics information, pitch information, and volume information through a skin vibration sensor of an input unit;
Obtaining target song sound information having lyrics information and pitch information corresponding to the lyrics information of the source song sound information and the pitch information from a storage unit;
Amplifying the acquired target song sound information according to the volume information of the source song sound information to obtain amplified target song sound information; And
And outputting the amplified target song sound information as a song sound,
The target song sound information includes tone information different from the source song sound information,
The acquiring of the target song sound information from the storage unit may include classifying the lyrics information by using an AI variable vector trained using an artificial intelligence network from the source song sound information input from the input unit. Voice processing method.

A computer program stored in a recording medium for implementing the voice processing method of claim 14.