KR102497878B1

KR102497878B1 - Vocal transcription learning method and apparatus for performing learning based on note-level audio data

Info

Publication number: KR102497878B1
Application number: KR1020220079533A
Authority: KR
Inventors: 금상은; 이종필
Original assignee: 뉴튠(주)
Priority date: 2021-11-26
Filing date: 2022-06-29
Publication date: 2023-02-09
Also published as: KR102417670B1

Abstract

A vocal transcription device for performing learning based on note-level audio data according to an embodiment comprises: one or more processors; and a memory module for storing instructions executable in the one or more processors. The processors include: a first artificial neural network having first audio data in a frequency region as first input information and outputting, as first output information, pitch information including the pitch information of a vocal for each frame of the first audio data; a preprocessing module for converting the first output information into first learning data including vocal information for each note; a third artificial neural network having second audio data in a frequency region as third input information and outputting, as third output information, pitch information including the pitch information of a vocal for each note of the second audio data; and a postprocessing module for converting the third output information into vocal information for each note, wherein the third artificial neural network can perform learning for the third artificial neural network based on the first learning data. Therefore, provided is a technology based on deep learning, wherein notes corresponding to a vocal melody in polyphonic music can be predicted more accurately.

Description

Vocal transcription learning method and apparatus for performing learning based on note-level audio data}

본 발명은 노트 레벨의 오디오 데이터를 기초로 학습을 수행하는 보컬 채보 학습 방법 및 장치에 관한 발명으로서, 보다 상세하게는 인공신경망을 이용하여 입력된 오디오에 대해 보컬에 대한 피치 정보를 정확히 출력하는 기술 및 이를 이용하여 인공신경망을 효율적으로 학습하는 기술에 관한 발명이다. The present invention relates to a vocal transcription learning method and apparatus for learning based on note-level audio data, and more particularly, a technique for accurately outputting pitch information for vocals for input audio using an artificial neural network. And an invention related to a technology for efficiently learning an artificial neural network using the same.

채보란 본래 기보되어 있지 않은 음악을 악보에 옮기는 것을 의미하는 것으로서, 채보의 목적은 편곡해서 레퍼터리에 넣어 이용하는 경우 또는 학문적 분석을 위한 경우로 나누어지며, 채보의 방법으로는 본래의 형태를 간략하게 적는 규범적(prescriptive)인 서법(書法)과, 연주된 그대로 면밀히 적는 기술적(descriptive)인 서법이 존재한다. Chasebo means to transfer music that is not originally notated into sheet music, and the purpose of transcription is divided into cases where it is used after being arranged and used in a repertoire or for academic analysis. There is a prescriptive writing style and a descriptive writing style that meticulously writes down as it is performed.

한편, 입력된 음원의 음악을 컴퓨터 프로세서 등에 의해 자동으로 악보에 옮기는 것을 자동 채보라고 하는데, 구체적으로 음성 처리에 있어서 곡조 인식의 한 분야인 자동 채보 시스템은 사람의 음성 및 악기로 연주된 노래로부터 음의 높이(음정(音程), interval), 길이(음장(音長), duration), 가사를 인식하여 그 결과를 악보의 형태로 나타내어주는 것을 의미한다. 자동 채보의 경우 기존의 음악에 익숙한 전문가가 직접 노래를 듣고 채보하는 방식에 비하여 시창자(始唱者)의 노래가 가진 음악적 특징을 시스템이 자동으로 인식하고 분석하여 이를 악보화할 수 있으므로 일반인도 쉽게 사용할 수 있는 장점이 존재한다.On the other hand, automatically transcribing the music of the input sound source into sheet music by a computer processor or the like is called automatic transcription. It means recognizing the height (pitch, interval), length (field, duration), and lyrics of the song and displaying the result in the form of a score. In the case of automatic transcription, the system automatically recognizes and analyzes the musical characteristics of the singer's song, compared to the existing method in which an expert familiar with music listens to and transcribes the song himself, so that the general public can easily record it. There are advantages to using it.

지금까지 알려진 자동 채보 방법으로는, 연속적인 음성신호에서 추출된 특징 정보를 각각의 음표로 인식할 수 있게 음소 단위로 분절된 구간을 합쳐 음절의 경계 정보로 사용하는 방법과, 음성신호에서 피치 간격마다 발생하는 음성의 최대값을 연결하여 구한 에너지 정보를 이용하여 음절 구간(segment)을 형성하는 방법 등이 있다.As for the automatic transcription method known so far, a method of combining segmented intervals in phoneme units so that feature information extracted from a continuous voice signal can be recognized as each note and using it as syllable boundary information, and a method of using pitch intervals in a voice signal There is a method of forming a syllable segment using energy information obtained by connecting the maximum value of the voice generated each time.

그러나, 상술한 바와 같은 종래의 자동 채보 시스템은, 음성신호의 연속적인 특성 때문에 음절을 분할하는 경우 그 경계가 모호한 부분에서는 효율성이 현저히 떨어지는 문제점이 존재하고, 예측된 음절 경계의 한 구간마다 음정 인식을 위하여 피치 정보의 대표값을 찾아 주어야 하는 과정을 추가하여야 하는 불편함이 존재한다. 또한, 마디 검출이 불가능하여 노래의 인식 결과를 완전한 악보의 형태로 나타내어 줄 수 없으므로 만족할 만한 결과를 제공해 줄 수 없는 문제점을 가지고 있으며, 입력되는 데이터가 반주가 없는 단선율(monophonic) 보컬인 경우에만 채보가 가능하다는 단점이 존재하였다.However, in the conventional automatic transcription system as described above, when dividing syllables due to the continuous characteristics of the voice signal, there is a problem in that the efficiency is significantly lowered in the part where the boundary is ambiguous, and the pitch is recognized for each section of the predicted syllable boundary. For this, there is an inconvenience of adding a process of finding a representative value of pitch information. In addition, it has a problem that it cannot provide satisfactory results because it is impossible to detect bars and thus cannot show the recognition result of a song in the form of a complete sheet music. There was a downside that it was possible.

또한, 통상적으로 사람은 자신의 스타일에 따라 노래를 부르기 때문에 노래 빠르기는 각각 다른 속도와 시간을 지니고 있으므로 개인차가 매우 크지만, 음장 인식에 있어서 종래의 방법은 일반화된 표준 데이터에 의거하여 표준 음표에 매핑(mapping)하는 방법을 사용하기 때문에 사람마다 다른 노래 입력의 빠르기에 적응하지 못하는 단점이 있다.In addition, since people usually sing according to their own style, individual differences are very large because each song has a different speed and time, but conventional methods for sound field recognition are based on standardized standard data. Since a mapping method is used, there is a disadvantage in that each person cannot adapt to the different song input speed.

한국공개특허 제10-2015-0084133호 (2015.07.22. 공개) - '음의 간섭현상을 이용한 음정인식 및 이를 이용한 음계채보 방법'Korean Patent Publication No. 10-2015-0084133 (published on July 22, 2015) - 'pitch recognition using sound interference and a method for transcribing scales using the same' 한국등록특허 제 10-1696555호 (2019.06.05.) - '영상 또는 지리 정보에서 음성 인식을 통한 텍스트 위치 탐색 시스템Korean Patent Registration No. 10-1696555 (2019.06.05.) - 'Text location search system through voice recognition in video or geographic information

따라서, 일 실시예에 따른 노트 레벨의 오디오 데이터를 기초로 학습을 수행하는 보컬 채보 학습 방법 및 장치는 설명한 문제점을 해결하기 위해 고안된 발명으로서, 종래기술보다 효과적으로 피치 정보와 보컬 정보를 포함하고 있는 채보 정보를 출력함으로써, 다성(polyphonic) 음악에서 보컬 멜로디에 해당하는 음표(note)를 보다 정확히 예측하는 딥러닝 기반의 기술을 제공하는데 그 목적이 있다. Therefore, a vocal transcription learning method and apparatus for learning based on note-level audio data according to an embodiment is an invention designed to solve the described problem, and transcription transcription containing pitch information and vocal information more effectively than the prior art. Its purpose is to provide a deep learning-based technology that more accurately predicts notes corresponding to vocal melodies in polyphonic music by outputting information.

보다 구체적으로 일 실시예에 따른 노트 레벨의 오디오 데이터를 기초로 학습을 수행하는 보컬 채보 학습 방법 및 장치는 프레임(frame) 레벨의 채보 정보를 노트 레벨의 채보 정보로 변환시키는 모델을 이용하여, 보다 적은 프레임 레벨의 데이터만으로도 효과적으로 노트 레벨의 채보 정보를 출력하는 방법 및 장치를 제공하는데 그 목적이 있다.More specifically, a vocal transcription learning method and apparatus for learning based on note-level audio data according to an embodiment uses a model that converts frame-level transcription information into note-level transcription information, An object of the present invention is to provide a method and apparatus for effectively outputting note-level transcription information with only a small amount of frame-level data.

일 실시예에 따른 노트 레벨의 오디오 데이터를 기초로 학습을 수행하는 보컬 채보 장치는 하나 이상의 프로세서 및 상기 하나 이상의 프로세서에서 실행 가능한 명령들을 저장하는 메모리 모듈을 포함하고, 상기 프로세서는, 주파수 영역의 제1오디오 데이터를 제1입력 정보로 하고, 상기 제1오디오 데이터에 대해 프레임(frame) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제1출력 정보로 출력하는 제1인공신경망, 상기 제1출력 정보를 노트(note) 단위의 보컬 정보를 포함하는 제1학습 데이터로 변환하는 전처리 모듈, 주파수 영역의 제2오디오 데이터를 제3입력 정보로 하고, 상기 제2오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제3출력 정보로 출력하는 제3인공신경망 및 상기 제3출력 정보를 노트 단위의 보컬 정보로 변형하는 후처리 모듈을 포함하고, 상기 제3인공신경망은, 상기 제1학습 데이터를 기초로 상기 제3인공신경망에 대해 학습을 수행할 수 있다.A vocal transcription apparatus for performing learning based on note-level audio data according to an embodiment includes one or more processors and a memory module storing instructions executable by the one or more processors, wherein the processor comprises: A first artificial neural network that takes audio data as first input information and outputs pitch information including vocal pitch information in frames for the first audio data as first output information; A pre-processing module for converting first output information into first training data including vocal information in units of notes, second audio data in the frequency domain as third input information, and notes (notes) for the second audio data A third artificial neural network that outputs pitch information including vocal pitch information in units of notes as third output information and a post-processing module that transforms the third output information into vocal information in units of notes, The third artificial neural network may perform learning on the third artificial neural network based on the first learning data.

상기 전처리 모듈 및 상기 후처리 모듈은, 피치 양자화(pitch quantization) 방법 리듬 양자화(rhythm quantization) 방법을 사용하여 프레임 단위의 피치 정보를 노트 레벨의 피치 정보로 변환할 수 있다.The pre-processing module and the post-processing module may convert frame-unit pitch information into note-level pitch information using a pitch quantization method and a rhythm quantization method.

상기 노트 레벨의 오디오 데이터를 기초로 학습을 수행하는 보컬 채보 장치는, 상기 제1학습 데이터를 저장하는 제1메모리 모듈 및 노트 레벨로 라벨링이 되어 있는 피치 정보를 포함하는 제2학습 데이터를 포함하는 제2메모리 모듈을 포함할 수 있다.The vocal transcription device for performing learning based on the note-level audio data includes a first memory module for storing the first learning data and second learning data including pitch information labeled at the note level. A second memory module may be included.

제3인공신경망은, 상기 제2학습 데이터를 기초로 상기 제3인공신경망에 대해 학습을 수행할 수 있다. The third artificial neural network may perform learning on the third artificial neural network based on the second learning data.

일 실시예에 따른 노트 레벨의 오디오 데이터를 기초로 학습을 수행하는 보컬 채보 장치는 하나 이상의 프로세서 및 상기 하나 이상의 프로세서에서 실행 가능한 명령들을 저장하는 메모리 모듈을 포함하고, 상기 프로세서는, 주파수 영역의 제1오디오 데이터를 제1입력 정보로 하고, 상기 제1오디오 데이터에 대해 프레임(frame) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제1출력 정보로 출력하는 제1인공신경망, 상기 제1출력 정보를 노트(note) 단위의 보컬 정보를 포함하는 제1학습 데이터로 변환하는 전처리 모듈, 주파수 영역의 제2오디오 데이터를 제3입력 정보로 하고, 상기 제2오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제3출력 정보로 출력하는 제3인공신경망, 주파수 영역의 제3오디오 데이터를 제5입력 정보로 하고, 상기 제3오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제5출력 정보로 출력하는 제5인공신경망 및 상기 제3출력 정보를 노트 단위의 보컬 정보를 포함하는 제2학습 데이터로 변환하는 제2후처리 모듈을 포함하고, 상기 제3인공신경망은, 상기 제2학습 데이터 및 제3학습 데이터 중 적어도 하나의 데이터를 기초로 학습을 수행할 수 있다.A vocal transcription apparatus for performing learning based on note-level audio data according to an embodiment includes one or more processors and a memory module storing instructions executable by the one or more processors, wherein the processor comprises: A first artificial neural network that takes audio data as first input information and outputs pitch information including vocal pitch information in frames for the first audio data as first output information; A pre-processing module for converting first output information into first training data including vocal information in units of notes, second audio data in the frequency domain as third input information, and notes (notes) for the second audio data A third artificial neural network that outputs pitch information including pitch information of vocals in units of note) as third output information, and uses third audio data in the frequency domain as fifth input information, and the third audio data A fifth artificial neural network outputting pitch information including vocal pitch information in units of notes as fifth output information and second learning data including vocal information in units of notes as the third output information and a second post-processing module for converting to , wherein the third artificial neural network may perform learning based on at least one of the second learning data and the third learning data.

상기 노트 레벨의 오디오 데이터를 기초로 학습을 수행하는 보컬 채보 장치는 상기 제5출력 정보를 노트 단위의 보컬 정보로 변형하는 제2후처리 모듈을 더 포함할 수 있다.The vocal transcription apparatus for learning based on the note-level audio data may further include a second post-processing module for transforming the fifth output information into note-level vocal information.

일 실시예에 따른 인공신경망을 포함하고 있는 프로세서를 이용한 노트 레벨의 자동 보컬 채보 장치의 학습 방법은, 상기 프로세서가 주파수 영역의 제1오디오 데이터를 제1입력 정보로 하고, 상기 제1오디오 데이터에 대해 프레임(frame) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제1출력 정보로 출력하는 제1인공신경망을 이용하여 상기 제1출력 정보를 출력하는 제1출력 정보 출력 단계, 상기 프로세서가 상기 제1출력 정보를 노트(note) 단위의 보컬 정보를 포함하는 제1학습 데이터로 변환하는 전처리 단계, 상기 프로세서가 주파수 영역의 제2오디오 데이터를 제3입력 정보로 하고, 상기 제2오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제3출력 정보로 출력하는 제3인공신경망을 이용하여 상기 제3출력 정보를 출력하는 제3출력 정보 출력 단계, 상기 프로세서가 상기 제3출력 정보를 노트 단위의 보컬 정보로 변환하는 후처리 단계 및 상기 프로세서가, 상기 제1학습 데이터를 기초로 상기 제3인공신경망에 대해 학습을 수행하는 학습 수행 단계를 포함할 수 있다. In the method of learning a note-level automatic vocal transcription device using a processor including an artificial neural network according to an embodiment, the processor sets first audio data in the frequency domain as first input information, and the first audio data a first output information outputting step of outputting the first output information using a first artificial neural network that outputs pitch information including pitch information of vocals as first output information in units of frames for the first output information; A pre-processing step of converting, by a processor, the first output information into first training data including vocal information in units of notes; the processor taking second audio data in the frequency domain as third input information; Third output information output that outputs the third output information using a third artificial neural network that outputs pitch information including vocal pitch information in units of notes for audio data as third output information a post-processing step in which the processor converts the third output information into vocal information in units of notes, and a learning step in which the processor performs learning for the third artificial neural network based on the first training data. can include

일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치는 보컬 정보를 포함하는 피치 정보를 추론함에 있어서 추정된 보컬 정보를 이용하고, 보컬 정보를 추론함에 있어서 추정된 피치 정보를 활용하므로, 종래 기술보다 효과적으로 오디오 데이터에 포함되어 있는 피치 정보와 보컬 정보를 추론할 수 있는 장점이 존재한다. 이에 따라 채보를 진행함에 있어서 보컬을 분리하는 전처리 단계가 존재하지 않아 연산량이 많이 줄어들고 또한, 보컬을 분리하지 않아도 정확도 높은 보컬 채보를 수행할 수 있는 장점이 존재한다. A note-level automatic vocal transcription method and apparatus using an artificial neural network according to an embodiment uses the estimated vocal information in inferring pitch information including vocal information, and utilizes the estimated pitch information in inferring the vocal information. Therefore, there is an advantage of being able to infer pitch information and vocal information included in audio data more effectively than the prior art. Accordingly, there is no pre-processing step for separating vocals in transcription, so the amount of calculation is greatly reduced, and there is an advantage that vocal transcription can be performed with high accuracy even without separating vocals.

또한, 일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치는 프레임 레벨의 피치 정보를 유사-노트 레벨의 피치 정보를 변환시킨 후, 변환된 데이터를 노트 레벨의 채보를 수행하는 인공신경망의 학습 데이터로 사용하기 때문에, 보다 적은 노트 레벨의 학습 데이터만으로도 효율적으로 노트 레벨의 피치 정보를 출력하는 인공신경망을 학습시킬 수 있는 장점이 존재한다. In addition, a method and apparatus for automatic vocal transcription of a note level using an artificial neural network according to an embodiment converts frame-level pitch information into pseudo-note level pitch information, and then performs note-level transcription of the converted data. Since it is used as training data of an artificial neural network, there is an advantage in that an artificial neural network capable of efficiently outputting note-level pitch information can be trained with less note-level training data.

또한, 인공신경망을 학습함에 있어서 준지도 학습(semi-supervised learning)에 기초한 학습을 수행하기 때문에 라벨링이 되어 있지 않은 적은 데이터만으로도 노트 레벨의 피치 정보를 출력하는 인공신경망을 효율적으로 학습시킬 수 있다.In addition, since learning based on semi-supervised learning is performed in learning the artificial neural network, it is possible to efficiently learn the artificial neural network that outputs note-level pitch information with only a small amount of unlabeled data.

이에 따라, 일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치는 다양한 장르의 음원에 대해서 유연하면서 동시에 정확한 채보를 진행할 수 있으며, 소스 분리 전처리 과정 없이 바로 다성 음악에서 보컬 멜로디를 채보할 수 있어, 종래 기술에 비해 채보 속도를 매우 향상시킬 수 있는 장점이 존재한다. Accordingly, the note-level automatic vocal transcription method and apparatus using an artificial neural network according to an embodiment can perform flexible and accurate transcription for sound sources of various genres, and can directly generate vocal melodies from polyphonic music without a source separation preprocessing process. Since it is possible to transcribe, there is an advantage that the transcribing speed can be greatly improved compared to the prior art.

도 1은 본 발명의 일 실시예에 따른 음악 채보 장치가 포함된 음악 채보 시스템의 블럭도이다.
도 2는 일 실시예에 따른 음악 채보 장치의 일부 구성 요소를 도시한 블록도이다.
도 3은 일 실시예에 따른 제1인공신경망의 입력 정보 및 출력 정보를 도시한 도면이다.
도 4는 일 실시예에 따른 제1인공신경망을 구성하는 일부 구성 요소를 도시한 블록도이다.
도 5는 일 실시예에 따른 레즈넷 블록의 일부 구성 요소를 도시한 블록도이다.
도 6은 본 발명의 일 실시예에 따른 제2인공신경망의 입력 정보와 출력 정보를 도시한 도면이다.
도 7은 본 발명의 일 실시예에 따른 제1인공신경망과 제2인공신경망을 포함하고 있는 제1인공신경망 모듈의 구성을 도시한 도면이다.
도 8은 본 발명의 다른 실시예에 따른 제1인공신경망과 제2인공신경망 을 포함하고 있는 제1인공신경망 모듈을 도시한 도면이다
도 9는 본 발명에 따른 음악 채보 인공신경망 모듈의 통합 손실함수를 계산하는 방법을 도시한 도면이다.
도 10은 일 실시예에 따른 인공신경망을 이용한 자동 음악 채보 장치의 구성 요소를 도시한 도면이다.
도 11은 일 실시예에 따른 자동 음악 채보 장치의 전처리 모듈을 설명하기 위한 도면이다.
도 12는 다른 실시예에 따른 인공신경망을 이용한 자동 음악 채보 장치의 구성 요소를 도시한 도면이다.1 is a block diagram of a music transcription system including a music transcription device according to an embodiment of the present invention.
2 is a block diagram showing some components of a music transcription device according to an embodiment.
3 is a diagram illustrating input information and output information of a first artificial neural network according to an embodiment.
4 is a block diagram illustrating some components constituting a first artificial neural network according to an embodiment.
5 is a block diagram illustrating some components of a RAZNET block according to an exemplary embodiment.
6 is a diagram showing input information and output information of a second artificial neural network according to an embodiment of the present invention.
7 is a diagram showing the configuration of a first artificial neural network module including a first artificial neural network and a second artificial neural network according to an embodiment of the present invention.
8 is a diagram showing a first artificial neural network module including a first artificial neural network and a second artificial neural network according to another embodiment of the present invention.
9 is a diagram showing a method of calculating an integrated loss function of a music transcription artificial neural network module according to the present invention.
10 is a diagram illustrating components of an automatic music transcription apparatus using an artificial neural network according to an embodiment.
11 is a diagram for explaining a pre-processing module of an automatic music transcription apparatus according to an embodiment.
12 is a diagram illustrating components of an automatic music transcription device using an artificial neural network according to another embodiment.

이하, 본 발명에 따른 실시 예들은 첨부된 도면들을 참조하여 설명한다. 각 도면의 구성요소들에 참조 부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명의 실시 예를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 실시예에 대한 이해를 방해한다고 판단되는 경우에는 그 상세한 설명은 생략한다. 또한, 이하에서 본 발명의 실시 예들을 설명할 것이나, 본 발명의 기술적 사상은 이에 한정되거나 제한되지 않고 당업자에 의해 변형되어 다양하게 실시될 수 있다.Hereinafter, embodiments according to the present invention will be described with reference to the accompanying drawings. In adding reference numerals to the components of each drawing, it should be noted that the same components have the same numerals as much as possible even if they are displayed on different drawings. In addition, in describing an embodiment of the present invention, if it is determined that a detailed description of a related known configuration or function hinders understanding of the embodiment of the present invention, the detailed description thereof will be omitted. In addition, embodiments of the present invention will be described below, but the technical idea of the present invention is not limited or limited thereto and can be modified and implemented in various ways by those skilled in the art.

또한, 본 명세서에서 사용한 용어는 실시 예를 설명하기 위해 사용된 것으로, 개시된 발명을 제한 및/또는 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. In addition, terms used in this specification are used to describe embodiments, and are not intended to limit and/or limit the disclosed invention. Singular expressions include plural expressions unless the context clearly dictates otherwise.

본 명세서에서, "포함하다", "구비하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는다.In this specification, terms such as "include", "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or the existence or addition of more other features, numbers, steps, operations, components, parts, or combinations thereof is not excluded in advance.

또한, 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "간접적으로 연결"되어 있는 경우도 포함하며, 본 명세서에서 사용한 "제 1", "제 2" 등과 같이 서수를 포함하는 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되지는 않는다. In addition, throughout the specification, when a part is said to be “connected” to another part, this is not only the case where it is “directly connected”, but also the case where it is “indirectly connected” with another element in the middle. Terms including ordinal numbers, such as "first" and "second" used herein, may be used to describe various components, but the components are not limited by the terms.

아래에서는 첨부한 도면을 참고하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략한다. Hereinafter, with reference to the accompanying drawings, embodiments of the present invention will be described in detail so that those skilled in the art can easily carry out the present invention. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted.

한편 본 발명의 명칭은 '인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치'로 기재하였으나, 이하 설명의 편의를 위해 '인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치'는 '음악 채보 장치'로 축약하여 설명하도록 한다.Meanwhile, the title of the present invention is described as 'a note-level automatic vocal transcription method and apparatus using an artificial neural network', but for convenience of description below, 'a note-level automatic vocal transcription method and apparatus using an artificial neural network' is 'music transcription'. It will be abbreviated as 'device'.

도1은 본 발명의 일 실시예에 따른 음악 채보 장치가 포함된 음악 채보 시스템의 블럭도이다.1 is a block diagram of a music transcription system including a music transcription device according to an embodiment of the present invention.

도1을 참조하면, 음악 채보 시스템(1)은 자동 채보 서비스를 제공하는 사용자 단말기(100)와 자동 채보 작업을 수행하는 음악 채보 장치(200)를 포함할 수 있으며, 사용자 단말기(100)는 이동 단말기로 구현되고, 음악 채보 장치(200)는 원격 서버로서 구현될 수 있다. Referring to FIG. 1, a music transcription system 1 may include a user terminal 100 that provides an automatic transcription service and a music transcription device 200 that performs an automatic transcription task, and the user terminal 100 moves It is implemented as a terminal, and the music transcription device 200 may be implemented as a remote server.

따라서, 사용자 단말기(100)는 PCS(Personal Communication System), GSM(Global System for Mobile communication), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), Wibro(Wireless Broadband Internet) 단말, 스마트폰(Smartphone), 스마트패드(SmartPad), 태블릿 PC, 스마트와치(smart watch), 스마트 글라스(smart glass), 웨어러블 기기(wearable device) 등과 같은 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치를 포함할 수 있다Accordingly, the user terminal 100 is a personal communication system (PCS), a global system for mobile communication (GSM), a personal digital assistant (PDA), an international mobile telecommunication (IMT)-2000, a code division multiple access (CDMA)-2000, W-CDMA (W-Code Division Multiple Access), Wibro (Wireless Broadband Internet) terminal, smartphone, smartpad, tablet PC, smart watch, smart glass, wearable It may include all kinds of handheld-based wireless communication devices such as wearable devices.

음악 채보 장치(200)가 구현되는 서버는 통상적인 서버(server)를 의미하는 바, 서버는 프로그램이 실행되고 있는 컴퓨터 하드웨어로서, 프린터 제어나 파일 관리 등 네트워크 전체를 감시하거나, 제어하거나, 메인프레임이나 공중망을 통한 다른 네트워크와의 연결, 데이터, 프로그램, 파일 같은 소프트웨어 자원이나 모뎀, 팩스, 프린터 공유. 기타 장비 등 하드웨어 자원을 공유할 수 있도록 지원할 수 있다.The server on which the music transcription device 200 is implemented means a typical server. A server is computer hardware on which a program is executed, and monitors or controls the entire network, such as printer control or file management, or a mainframe. or connection to another network over a public network, sharing software resources such as data, programs, files, modems, fax machines, and printers. It can support sharing of hardware resources such as other equipment.

사용자 단말기(100)는 녹취부(110), 통신부(120), MIDI 생성부(130) 및 디스플레이부(140)를 포함할 수 있다. The user terminal 100 may include a recording unit 110, a communication unit 120, a MIDI generation unit 130, and a display unit 140.

녹취부(110)는 음원을 녹취할 수 있다. 구체적으로 녹취부(100)는 사용자 단말(100) 외부에서 들려오는 음원을 녹취하거나, 사용자 단말(100) 자체에서 재생되고 있는 음원을 녹취할 수 있다. 녹취부(100)에 의해 녹취된 음원은 통신부(120)를 통해 음악 채보 장치(200)로 송신될 수 있다. The recording unit 110 may record a sound source. In detail, the recording unit 100 may record a sound source heard from outside the user terminal 100 or a sound source being reproduced in the user terminal 100 itself. The sound source recorded by the recording unit 100 may be transmitted to the music transcription device 200 through the communication unit 120 .

통신부(120)는 음악 채보 장치(200)로 녹음된 음원을 포함한 데이터를 전송하거나 음악 채보 장치(200)로부터 데이터를 수신할 수도 있다. 예를 들어, 통신부(120)는 음악 채보 장치(200)로부터 채보된 악보를 수신할 수도 있고 채보된 악보에 관련된 부가 서비스 정보를 수신할 수도 있다.The communication unit 120 may transmit data including a sound source recorded by the music transcription device 200 or receive data from the music transcription device 200 . For example, the communication unit 120 may receive the transcribed sheet music from the music transcription device 200 or may receive additional service information related to the transcribed sheet music.

MIDI 생성부(130)는 채보된 악보에 따른 MIDI(Musical Instrument Digital Interface) 파일을 생성하고 재생할 수 있다. The MIDI generating unit 130 may generate and reproduce a Musical Instrument Digital Interface (MIDI) file according to the transcribed sheet music.

디스플레이부(140)는 채보된 악보 또는 기타 부가 서비스 정보를 사용자가 인지할 수 있도록 외부로 출력할 수 있다. 따라서, 디스플레이부(140)는 액정 디스플레이(Liquid Crystal Display: LCD) 패널, 발광 다이오드(Light Emitting Diode: LED) 패널 또는 유기 발광 다이오드(Organic Light Emitting Diode: OLED) 패널 등 다양한 디스플레이 패널을 포함할 수 있다. 한편, 디스플레이부가 터치 패드(touch pad) 등과 같은 GUI(Graphical User interface), 즉 소프트웨어인 장치를 포함하는 경우, 사용자의 입력을 수신하는 입력부(미도시)의 역할을 수행할 수도 있다. The display unit 140 may externally output the transcribed sheet music or other additional service information so that the user can recognize it. Accordingly, the display unit 140 may include various display panels such as a liquid crystal display (LCD) panel, a light emitting diode (LED) panel, or an organic light emitting diode (OLED) panel. there is. Meanwhile, when the display unit includes a graphical user interface (GUI) such as a touch pad, that is, a software device, it may serve as an input unit (not shown) that receives a user's input.

음악 채보 장치(200)는 프로세서(300, 메모리 모듈(400), 부가 서비스 모듈(500), 사용자 관리 모듈(600)을 포함할 수 있다.The music transcription device 200 may include a processor 300, a memory module 400, an additional service module 500, and a user management module 600.

프로세서(300)는 메모리 모듈(400)에 저장되어 있는 음원 또는 사용자 단말기(100)로부터 수신한 음원에 대해 인공신경망 모듈을 이용하여 자동으로 채보를 생성할 수 있다. 이에 대한 자세한 설명은 후술하도록 한다. The processor 300 may automatically generate transcription for a sound source stored in the memory module 400 or a sound source received from the user terminal 100 using an artificial neural network module. A detailed description of this will be described later.

메모리 모듈(400)은 사용자 단말기(100)로부터 수신한 음원을 저장하거나, 프로세서(300)가 인공신경망 모듈을 학습하고 추론함에 있어서 필요한 각종 데이터가 저장될 수 있다. 사용자 단말기(100) 등을 통해 수신된 레퍼런스 데이터 또는 프로세서가 학습 또는 추론을 하면서 생성한 각종 데이터 등이 저장될 수 있다. The memory module 400 may store a sound source received from the user terminal 100 or store various data necessary for the processor 300 to learn and infer the artificial neural network module. Reference data received through the user terminal 100 or the like or various data generated by the processor during learning or reasoning may be stored.

부가 서비스 모듈(500)은 프로세서(300)가 채보한 악보에 관련된 부가 서비스 정보를 생성할 수 있다. The additional service module 500 may generate additional service information related to the score transcribed by the processor 300 .

일 예로 부가 서비스 모듈(500)은 부가 서비스로서 작곡 도우미 서비스를 제공할 수 있다. 본 발명에 따라 제공되는 작곡 도우미 서비스는 사용자 단말기(100)를 이용하는 음악 응용 서비스들 중의 하나로서 사용자 단말기(100)부터 채집된 음원을 원격 서버로서 구현된 음악 채보 장치(200)를 활용하여 자동으로 악보화하며 이를 사용자 단말기(100)에 전달하여 보관할 수 있게 하는 보다 향상된 성능의 작곡 도우미 서비스를 의미할 수 있다.For example, the additional service module 500 may provide a composition assistant service as an additional service. The composer helper service provided according to the present invention is one of music application services using the user terminal 100, and the sound source collected from the user terminal 100 is automatically recorded using the music transcription device 200 implemented as a remote server. It may refer to a composition helper service with improved performance that converts music into music and transmits it to the user terminal 100 for storage.

여기서 향상된 성능의 작곡 도우미 서비스란 종래기술들보다 정확한 채보를 제공하는 음악 채보 장치(200)를 활용함을 의미한다. Here, the composition helper service with improved performance means that the music transcription device 200 that provides more accurate transcription than the prior art is utilized.

통상 음악의 문외한인 사람도 흥얼거림을 통해 떠오르는 멜로디를 가질 수 있으며, 이러한 멜로디를 연결하여 작곡을 완성하고자 하는 욕구를 가질 수 있다. 예를 들어, 엄마가 아기를 위해 자장가를 흥얼거린다고 하자. 이 흥얼거리는 자작 자장가를 녹음한 후에 이를 음원의 형태로 보관하였다가 재생할 수도 있지만, 악보로 만들어 두면 데이터를 저장하는데 소요되는 용량도 작아지며 다음에 다시 이어서 노래를 만들 수도 있고 이렇게 만들어져서 악보로 저장된 노래는 다음에 어느 누구라도 재현이 가능할 것이다.Even a person who is not familiar with music can have a melody that comes to mind through humming, and may have a desire to complete a composition by connecting this melody. For example, suppose a mother sings a lullaby for her baby. After recording this humming lullaby, it can be stored in the form of a sound source and then played back. The song will be able to be reproduced by anyone next time.

이러한 작곡 도우미 서비스를 제공받기 위하여, 사용자는 이동 단말기(100)를 통해 음원을 녹취하고 이를 원격 서버로서 구현된 음악 채보 장치(200)로 전송한다. 이동 단말기(100)에서 음원의 녹취는 녹취부(110)를 통해 수행되고 음원의 전송은 통신부(120)를 통해 수행될 수 있다.In order to receive such a composition assistant service, a user records a sound source through the mobile terminal 100 and transmits it to the music transcription device 200 implemented as a remote server. In the mobile terminal 100, sound recording may be performed through the recording unit 110 and transmission of the sound source may be performed through the communication unit 120.

또한, 부가 서비스 모듈(500)은 프로세서(300)가 생성한 데이터에 기반하여 멜로디 기반의 음악 검색 시스템, 멜로디 기반의 유사 음원 검색 서비스 등도 함께 제공해 줄 수 있다. In addition, the additional service module 500 may also provide a melody-based music search system and a melody-based similar sound source search service based on data generated by the processor 300 .

음악 채보 장치(200)는 등록된 사용자의 경우 음원 파일을 수신하여 음원을 분석, 채보하며 채보된 결과를 사용자 단말기(100)로 전송할 수 있다. 음악 채보 장치(200)에서 음원의 분석, 채보는 프로세서(300)를 통해 수행되고 채보된 결과는 통신부를 통해 사용자 단말기(100)로 송신될 수 있다. In the case of a registered user, the music transcription device 200 may receive a sound source file, analyze and transcribe the sound source, and transmit the transcribed result to the user terminal 100 . In the music transcription device 200, sound source analysis and transcription are performed through the processor 300, and the transcribed result may be transmitted to the user terminal 100 through a communication unit.

사용자 단말기(100)는 MIDI 생성부(130) 및 디스플레이부(140)를 통해 채보된 악보를 디스플레이하고 그 악보에 따른 음악을 재생할 수 있으며, 사용자는 채보된 악보를 보고 그 악보에 따른 음악을 들으면서 추가, 삭제, 수정 등을 통해 작곡을 완성해갈 수 있다.The user terminal 100 can display the transcribed score through the MIDI generator 130 and the display unit 140 and reproduce music according to the score, and the user sees the transcribed score and listens to the music according to the score. You can complete your composition by adding, deleting, modifying, etc.

사용자 관리 모듈(600)는 음악 채보 장치(200)를 사용하는 사용자 단말기(100)의 사용자를 관리하기 위한 데이터를 저장할 수 있다. 한편, 도면에는 도시하지 않았지만 음악 채보 장치(200)는 통신부를 포함할 수 있으며, 통신부는 사용자 단말기(100)로 데이터를 전송할 수 있다. 예를 들어, 통신부는 사용자 단말기(100)로 프로세서(300) 채보한 악보 정보를 전송할 수도 있고 부가 서비스 모듈(500)이 생성한 각종 부가 서비스 정보를 전송할 수도 있다.The user management module 600 may store data for managing users of the user terminal 100 using the music transcription device 200 . Meanwhile, although not shown in the drawings, the music transcription device 200 may include a communication unit, and the communication unit may transmit data to the user terminal 100 . For example, the communication unit may transmit musical score information transcribed by the processor 300 to the user terminal 100 or may transmit various additional service information generated by the additional service module 500 .

도 2는 일 실시예에 따른 음악 채보 장치의 일부 구성 요소를 도시한 블록도이며, 도 3은 일 실시예에 따른 제1인공신경망의 입력 정보 및 출력 정보를 도시한 도면이다. 도 4는 일 실시예에 따른 제1인공신경망을 구성하는 일부 구성 요소를 도시한 블록도이며, 도 5는 일 실시예에 따른 레즈넷 블록의 일부 구성 요소를 도시한 블록도이다. 2 is a block diagram showing some components of a music transcription device according to an embodiment, and FIG. 3 is a diagram showing input information and output information of a first artificial neural network according to an embodiment. 4 is a block diagram showing some components constituting a first artificial neural network according to an embodiment, and FIG. 5 is a block diagram showing some components of a RAZNET block according to an embodiment.

도 2를 참조하면, 일 실시예에 따른 음악 채보 장치(200)는 프로세서(300), 메모리 모듈(400), 부가 서비스 모듈(500) 및 사용자 관리 모듈(600)을 포함할 수 있으며, 프로세서(300)는 제1인공신경망(311)과 제2인공신경망(312)을 포함하고 있는 제1인공신경망 (311)을 포함할 수 있다. 도 1에서 설명한 내용과 중복되는 설명은 생략하고, 도 2 내지 도 7에서 음악 채보 인공신경망 모듈에 대해서 구체적으로 알아본다. Referring to FIG. 2, the music transcription device 200 according to an embodiment may include a processor 300, a memory module 400, an additional service module 500, and a user management module 600, and a processor ( 300) may include a first artificial neural network 311 including a first artificial neural network 311 and a second artificial neural network 312. Descriptions overlapping those described in FIG. 1 will be omitted, and the music transcription artificial neural network module will be described in detail in FIGS. 2 to 7 .

한편 도 1에서는 프로세서(300)가 제1인공신경망 (311)만을 포함하는 것으로 도시하였으나, 프로세서(300)는 제3인공신경망(313)과 제4인공신경망(314)을 포함하고 있는 제2인공신경망(320) 및 제5인공신경망(115)과 제6인공신경망(316)을 포함하고 있는 제3인공신경망(330)을 포함할 수 있으며, 각각의 인공신경망 모듈은 그 성격에 비추어 음악 채보 인공신경망 모듈이라 명칭될 수 도 있다.Meanwhile, in FIG. 1, the processor 300 is illustrated as including only the first artificial neural network 311, but the processor 300 includes a third artificial neural network 313 and a fourth artificial neural network 314, and the second artificial neural network 314 is included. It may include a neural network 320 and a third artificial neural network 330 including a fifth artificial neural network 115 and a sixth artificial neural network 316, and each artificial neural network module is a music transcription artificial intelligence module in view of its nature. It may also be called a neural network module.

제1인공신경망(311)은 오디오 데이터를 포함하고 있는 제1입력 정보(10)를 입력 정보로 하여, 입력된 오디오 데이터에 포함되어 있는 보컬에 대한 프레임 단위의 피치(pitch) 정보를 출력하는 인공신경망 모듈을 의미하고, 제2인공신경망(312)은 제1인공신경망(311)의 중간 정보들을 합산한 제2입력 정보(20)를 입력 정보로 하여, 입력된 오디오 데이터에 포함되어 있는 보컬의 존재 유무에 대한 정보인 보컬 정보를 출력하는 인공신경망 모듈을 의미한다. The first artificial neural network 311 takes the first input information 10 including audio data as input information and outputs pitch information in units of frames for vocals included in the input audio data. It means a neural network module, and the second artificial neural network 312 takes the second input information 20 obtained by summing the intermediate information of the first artificial neural network 311 as input information, It refers to an artificial neural network module that outputs vocal information, which is information about existence or nonexistence.

제1인공신경망(311)이 출력하는 피치 정보는 입력된 오디오 데이터에 대해 프레임 별로 각각의 프레임에 보컬의 존재 유무에 대한 정보 및 보컬이 존재한다면 그 보컬의 음높이 정보를 포함하고 있다. The pitch information output by the first artificial neural network 311 includes information on the presence or absence of a vocalist in each frame for each frame of input audio data and, if a vocalist exists, pitch information of the vocalist.

도 3과 도 4를 참조하여 제1인공신경망(311)에 대해 구체적으로 설명하면, 일 실시예에 따른 제1인공신경망(311)은 오디오 데이터를 포함하는 제1입력 정보(10)를 입력 정보로 하고, 상기 오디오 데이터에 대해 포함되어 있는 보컬 정보에 대한 피치 정보를 포함하는 제1출력 정보(20)를 출력 정보로 하는, 기 학습된 인공신경망 모듈을 의미하며, 제1인공신경망(311)은 제1입력 정보(10)를 기초로 제1출력 정보(20)를 추론하는 추론 세션(미도시)과, 제1입력 정보(10) 와 제1출력 정보(20) 및 제1레퍼런스 데이터를 기초로 제1출력 정보(20)의 정확도를 높이는 방법으로 제1인공신경망(311)의 파라미터를 조정하고 업데이트 하는 학습 세션(미도시)을 포함할 수 있다. Referring to FIGS. 3 and 4, the first artificial neural network 311 will be described in detail. The first artificial neural network 311 according to an embodiment converts first input information 10 including audio data into input information. , and means a pre-learned artificial neural network module whose output information is the first output information 20 including pitch information for vocal information included in the audio data, and the first artificial neural network 311 An inference session (not shown) for inferring the first output information 20 based on the first input information 10, the first input information 10, the first output information 20, and the first reference data A learning session (not shown) may be included to adjust and update the parameters of the first artificial neural network 311 in a way to increase the accuracy of the first output information 20 based on the basis.

제1입력 정보(10)는 사용자가 채보 정보를 생성하고 싶은 음원 데이터를 의미하며, 제1입력 정보(10)는 프레임(frame)을 기준으로 생성된 데이터를 의미할 수 있다. 도 4에서는 일 예로 제1인공신경망(311)에 입력되는 제1입력 정보(10)는 한번에 31개의 프레임 데이터가 입력되는 데이터로 도시하였으나, 본 발명의 실시예가 이러한 예로 한정되는 것은 아니고, 제1인공신경망(311)에 입력되는 제1입력 정보(10)에 포함되어 있는 오디오 데이터에 대한 프레임의 개수는 다양한 개수로 설정될 수 있다. The first input information 10 means sound source data for which the user wants to generate transcription information, and the first input information 10 may mean data generated based on a frame. In FIG. 4, as an example, the first input information 10 input to the first artificial neural network 311 is shown as data inputting 31 frame data at a time, but the embodiment of the present invention is not limited to this example, and the first input information 10 The number of frames for the audio data included in the first input information 10 input to the artificial neural network 311 may be set to various numbers.

제 1출력 정보(20)는 제1인공신경망 (311)에 입력된 오디오 데이터에 대해 프레임별 피치 정보를 포함할 수 있다. 따라서, 제1출력 정보(20)는 각각의 프레임별로 보컬의 존재 유무 및 보컬의 크기에 대한 정보를 포함할 수 있으며, 보컬이 존재하지 않는 경우 해당 프레임에 대한 출력 정보는 0으로 출력될 수 있다. The first output information 20 may include pitch information for each frame of audio data input to the first artificial neural network 311 . Therefore, the first output information 20 may include information about the presence or absence of a vocalist and the size of a vocalist for each frame, and when no vocalist exists, the output information for the corresponding frame may be output as 0. .

따라서, 제1인공신경망(311)이 출력하는 프레임별 정보는 도 3의 오른쪽 아래 표시된 바와 같이 보컬이 아예 존재하지 않는 프레임에서는 NV(Non Vocal) 정보로 정보가 출력되고, 보컬이 존재하는 프레임에서는 각각의 프레임에서의 음높이 정보(예를 들어, D2, B5 등) 정보로 정보가 출력된다.Therefore, information for each frame output by the first artificial neural network 311 is output as NV (Non Vocal) information in a frame where no vocal exists at all, as shown in the lower right of FIG. 3, and in a frame where vocal exists. Information is output as pitch information (eg, D2, B5, etc.) information in each frame.

한편, 지칭 명칭에 따라 제1인공신경망(311)은 피치 정보 출력 인공신경망 모듈로 지칭될 수 있으며, 도 4에서는 일 예로 제1인공신경망(311)에 입력되는 제1입력 정보(10)는 31개의 프레임 데이터가 입력되는 것으로 도시하였는바, 제1인공신경망(311에서 출력되는 정보는 상기 31개의 프레임에 대한 각각의 피치 정보를 포함할 수 있다. 따라서, 도 4에 도시한 바와 같이 출력 정보는 각각의 프레임(제1프레임, 제2프레임, 제3프레임 ~ 제31프레임)에 대해 음의 크기가 존재하는지 여부 및 음의 크기가 존재한다면 어느 정도의 크기를 가지고 있는지에 대한 정보를 포함할 수 있다. Meanwhile, according to the designation name, the first artificial neural network 311 may be referred to as a pitch information output artificial neural network module, and in FIG. 4 , for example, the first input information 10 input to the first artificial neural network 311 is 31 Since it is shown that two frame data are input, the information output from the first artificial neural network 311 may include pitch information for each of the 31 frames. Therefore, as shown in FIG. 4, the output information is For each frame (first frame, second frame, third frame to 31st frame), it may include information on whether or not there is a sound level and, if so, how loud it is. there is.

제1인공신경망(311)은 기 공지되어 있는 인공신경망을 구성하는 여러 컨볼루션 블록(convolution block) 및 레즈넷 블록(resnet block)들을 조합하여 구현될 수 있다. 일 예로 도 4에 도시된 바와 같이 제1인공신경망(311)은 컨볼루션 블록(121), 제1레즈넷 블록(122), 제2레즈넷 블록(123), 제3레즈넷 블록(124), 폴링 블록(125) 및 제1LSTM 블록(126) 등을 포함할 수 있다. The first artificial neural network 311 may be implemented by combining several convolution blocks and resnet blocks constituting a known artificial neural network. As an example, as shown in FIG. 4, the first artificial neural network 311 includes a convolution block 121, a first Resnet block 122, a second Resnet block 123, and a third Resnet block 124. , a polling block 125 and a 1LSTM block 126, and the like.

도 4에 도시된 제1인공신경망(311)의 각각의 구성요소는 일 예에 불과할 뿐, 본 발명의 실시예가 도 4에 도시된 구성 요소로 제한되는 것은 아니다. 따라서, 예를 들어, 컨볼루션 블록은 한 개가 아닌 복수 개 구비될 수 있으며, 레즈넷 블록은 3개가 아닌 1개, 2개 혹은 4개 이상의 블록으로 구비될 수 있다. Each component of the first artificial neural network 311 shown in FIG. 4 is merely an example, and the embodiment of the present invention is not limited to the components shown in FIG. 4 . Therefore, for example, not one but a plurality of convolution blocks may be provided, and one, two, or four or more blocks instead of three Resnet blocks may be provided.

본 발명에 따른 제1인공신겸망(311)을 구성하는 블록들에 대해 구체적으로 알아보면, 제1인공신경망(311)에 입력되는 컨볼루션 블록(121)은 입력되는 제1입력 정보(10)에 대해 컨볼루션 연산을 수행하는 블록을 의미하며, 대표적으로 CNN 네트워크에서 사용되는 컨볼루션 연산이 수행될 수 있다.Looking in detail at the blocks constituting the first artificial neural network 311 according to the present invention, the convolution block 121 input to the first artificial neural network 311 is the input first input information 10 It means a block that performs a convolution operation on , and a convolution operation typically used in a CNN network can be performed.

레즈넷 블록은 Residual Network(ResNet)를 수행하는 블록을 의미한다. 도 4에 도시되어 있는 레즈넷 블록은 공지되어 있는ResNet을 본 발명의 목적에 맞춰 변형되어 구현된 레즈넷을 의미하는데, 일 실시예로 복수 개의 레즈넷 블록이 직렬적으로 연결되어 있을 수 있다. 따라서, 이전 레즈넷 블록의 출력 정보는 직렬적으로 연결되어 있는 레즈넷 블록의 입력 정보로 입력될 수 있다. A ResNet block means a block that performs Residual Network (ResNet). The Resnet block shown in FIG. 4 means a ResNet implemented by modifying the known ResNet according to the purpose of the present invention. In one embodiment, a plurality of Resnet blocks may be connected in series. Therefore, the output information of the previous Resnet block can be input as the input information of the serially connected Resnet block.

본 발명에 다른 제1인공신경망(311)은 도4에 도시된 바와 같이 제1레즈넷 블록(111), 제2레즈넷 블록(112) 및 제3레즈넷 블록(113) 즉, 총 3개의 레즈넷 블록을 포함하는 것이 출력 정보의 정확성 및 프로세스의 효율성을 고려하였을 때, 가장 좋은 출력 정보를 출력하는 것으로 실험 결과가 나와, 제1인공신경망(311)이 3개의 레즈넷 블록을 포함하는 것으로 도시하였으나, 본 발명의 실시예가 이로 한정되는 것은 아니고 레즈넷 블록의 개수와 배치 형태는 발명의 목적에 맞춰 다양하게 변형될 수 있다. The first artificial neural network 311 according to the present invention, as shown in FIG. Experimental results show that including the Resnet block outputs the best output information when considering the accuracy of the output information and the efficiency of the process, so that the first artificial neural network 311 includes three Resnet blocks. Although illustrated, the embodiment of the present invention is not limited thereto, and the number and arrangement of the RAZnet blocks may be variously modified according to the purpose of the invention.

제1인공신경망(311)의 각각의 레즈넷 블록은 도 5에 도시된 바와 같이 수정된 Residual Network로 구현될 수 있다. 따라서, 제1레즈넷 블록(112)은 컨볼루션 블록(121)의 출력 정보를 입력 정보를 입력 받고, 여러 네트워크 연산을 통해 출력 정보로 제1-1중간 정보(11)를 출력 할 수 있다.Each Resnet block of the first artificial neural network 311 may be implemented as a modified residual network as shown in FIG. 5 . Accordingly, the first Resnet block 112 may receive the output information of the convolution block 121 as input information and output the 1-1 intermediate information 11 as output information through various network operations.

구체적으로, 제2레즈넷 블록(113)은 재1레즈넷 블록(112)의 출력 정보인 제1-1중간 정보(11)를 입력 정보를 입력 받고, 제1-1중간 정보(11)에 대해 여러 네트워크 연산을 수행한 후 출력 정보로 제2-1중간 정보(12)를 출력할 수 있다. 제3레즈넷 블록(114) 또한 같은 프로세서에 의해 제2-1중간 정보(12)를 입력 정보로 입력 받아 제3-1중간 정보(13)를 출력 정보로 출력 할 수 있다.Specifically, the second Resnet block 113 receives the input information of the 1-1 intermediate information 11, which is the output information of the second Resnet block 112, and converts it to the 1-1 intermediate information 11. After performing several network operations on the 2-1st intermediate information 12 can be output as output information. The third RESNET block 114 may also receive the 2-1 intermediate information 12 as input information and output the 3-1 intermediate information 13 as output information by the same processor.

도 5를 참조하여, 제1레즈넷 블록(112)에 대해 설명하면(제2레즈넷 블록 및 제3레즈넷 블록 또한 제1레즈넷 블록과 동일한 구조를 가진다) 제1레즈넷 블록(112)은 도 5에 도시된 바와 같이 순차적으로 BN/LReLU 블록(121), MaxPool(1X4) 블록(122), Conv 2D 블록(123), BN/LReLU 블록(124) 및 Conv 2D 블록(125)을 포함할 수 있으며, 각각의 블록은 블록의 명칭에 대응되는 연산을 수행할 수 있다. BN/LReLU, MaxPool, Conv 2D 등의 연산은 이미 공지되어 있는 연산 기술에 해당하는바, 이에 대한 구체적인 설명은 생략하도록 한다. Referring to FIG. 5, the first Resnet block 112 will be described (the second Resnet block and the third Resnet block also have the same structure as the first Resnet block). The first Resnet block 112 As shown in FIG. 5, sequentially includes a BN/LReLU block 121, a MaxPool (1X4) block 122, a Conv 2D block 123, a BN/LReLU block 124, and a Conv 2D block 125. and each block can perform an operation corresponding to the name of the block. Since calculations such as BN/LReLU, MaxPool, and Conv 2D correspond to known calculation technologies, a detailed description thereof will be omitted.

본 발명에 따른 제1레즈넷 블록(112)이 공지되어 있는 레즈넷 네트워크와의 차이점에 대해 설명하면, 제1레즈넷 블록(112)에서 출력되는 정보는 Conv 2D 블록(125)에서 출력되는 정보와 MaxPool(1X4) 블록(122)에서 출력되는 정보가 합산되어 제1-1중간 정보(11)로 출력될 수 있다. 즉, MaxPool(1X4) 블록에서 출력되는 정보가 X정보라 한다면, X정보는 Conv 2D 블록에서 콘불루션 2D 연산을 거쳐 Y정보가 되므로, 최종적으로 제1중간 정보(11)는 Y정보와 Conv 2D 블록(325)에서 출력되는 Z 정보의 합산 정보로 구현될 수 있다. The difference between the first Resnet block 112 according to the present invention and the known Resnet network is explained, the information output from the first Resnet block 112 is the information output from the Conv 2D block 125 and information output from the MaxPool (1X4) block 122 may be summed and output as the 1-1st intermediate information 11. That is, if the information output from the MaxPool (1X4) block is X information, X information becomes Y information through convolution 2D operation in the Conv 2D block, so the first intermediate information 11 is finally Y information and Conv 2D It may be implemented as summation information of Z information output in block 325 .

제3레즈넷 블록(114)에서 출력된 제3중간 정보(13)는 풀링 블록(115)을 거쳐 제1LSTM 블록(316)으로 입력될 수 있다. 제1LSTM 블록은 RNN 네트워크의 일종인 LSTM(Long short-term memory) 네트워크로 구현된 블록을 의미한다. 오디오 데이터의 경우 입력되는 오디오가 연속적인 특징을 가지고 있다는 점에서, 이전 데이터의 결과를 활용하는LSTM 네트워크를 이용하는 것이 입력된 오디오 데이터에 대한 피치 정보나 음정 정보 효과적으로 출력할 수 있는 장점이 존재한다. 도 4에 도시된 제1인공신경망(310)의 경우 제1입력 정보(10)로 31개의 프레임에 해당하는 오디오 데이터를 활용하였으므로, 제1LSTM 블록(316) 또한 31개의 레이어로 구성된 네트워크를 활용하여 구현될 수 있다. The third intermediate information 13 output from the third RESnet block 114 may be input to the first LSTM block 316 through the pulling block 115 . The first LSTM block means a block implemented as a long short-term memory (LSTM) network, which is a kind of RNN network. In the case of audio data, since the input audio has a continuous characteristic, using an LSTM network that utilizes the result of previous data has the advantage of effectively outputting pitch information or pitch information for the input audio data. In the case of the first artificial neural network 310 shown in FIG. 4, since audio data corresponding to 31 frames is used as the first input information 10, the 1st LSTM block 316 also utilizes a network composed of 31 layers can be implemented

제1LSTM 블록(316)에 따라 출력되는 정보인 제1출력 정보(20)는 입력된 오디오 데이터에 대한 피치 정보를 포함할 수 있다.The first output information 20, which is information output according to the 1LSTM block 316, may include pitch information of the input audio data.

구체적으로, 제1출력 정보(20)는 입력된 각각의 프레임에 대응되는 보이스의 피치 정보를 출력할 수 있으며, 제1출력 정보(20)에는 오디오 데이터에 사람의 보컬 자체가 존재하는지, 존재하지 않는지에 대한 데이터 정보인 제1-1출력 정보(Omv)와, 보컬의 존재 여부에 대한 정보 및 보컬이 존재한다면, 그 음의 높낮이 정보를 포함하고 있는 제1-2출력 정보(Om)를 포함할 수 있다.Specifically, the first output information 20 may output pitch information of a voice corresponding to each input frame, and the first output information 20 may determine whether a human vocal exists in the audio data or not 1-1st output information (Omv), which is data information about whether or not there is a voice, and 1-2nd output information (Om) including information on whether vocals exist and, if vocals exist, 1-2nd output information (Om) including pitch information of the sound can do.

예를 들어, 도 4에 도시된 바와 같이 제1프레임과 제2프레임에서는 보컬이 존재하고 제3프레임에서는 보컬이 존재하지는 경우, 제1-1출력 정보(Omv)는 제1프레임과 제2프레임에서는 ON 정보를 가지며, 제3프레임에서는 OFF 정보를 가지게 된다. 따라서, 이 경우 제3프레임에서는 NV(Non Vocal) 정보로, 제1프레임과 제2프레임에서는 V(Vocal) 정보로 1-1출력 정보가 출력된다.For example, as shown in FIG. 4 , when vocals exist in the first frame and the second frame and vocals do not exist in the third frame, the 1-1 output information Omv is the first frame and the second frame. has ON information, and has OFF information in the third frame. Therefore, in this case, 1-1 output information is output as NV (Non Vocal) information in the third frame and V (Vocal) information in the first and second frames.

한편, 음의 높낮이에 대한 정보인 제1-2출력 정보(Om)는 앞서 설명한 바와 같이 보컬의ON/OFF 정보 및 보컬이 존재한다면 각각의 프레임에서의 음의 높이에 대한 정보를 수치적으로 계산한 정보를 포함할 수 있다. 따라서, 이 경우 제1프레임과 제2프레임에서는 보컬의 음의 높이에 대응되는 정보가, 제3프레임에서는 보컬이 존재하지 않으므로 0으로 정보가 출력될 수 있다. On the other hand, as described above, the first-second output information (Om), which is information about the pitch of the sound, numerically calculates the ON/OFF information of the vocal and the information about the pitch of the sound in each frame if the vocal is present. information may be included. Accordingly, in this case, since information corresponding to the pitch of the vocal in the first frame and the second frame and no vocal in the third frame, information may be output as 0.

지금까지 제1인공신경망 모듈(310)이 포함하고 있는 제1인공신경망(311)에 대해 자세히 알아보았다. 이하 제1인공신경망 모듈(310)이 포함하고 있으면서, 제1인공신경망(311)과 병렬적으로 연결되어 있는 제2인공신경망(312)에 대해 구체적으로 알아본다. So far, the first artificial neural network 311 included in the first artificial neural network module 310 has been studied in detail. Hereinafter, the second artificial neural network 312 included in the first artificial neural network module 310 and connected in parallel with the first artificial neural network 311 will be described in detail.

도 6은 본 발명의 일 실시예에 따른 제2인공신경망의 입력 정보와 출력 정보를 도시한 도면이고, 도 7은 본 발명의 일 실시예에 따른 제1인공신경망과 제2인공신경망을 포함하고 있는 제1인공신경망 모듈의 구성을 도시한 도면이다. 6 is a diagram showing input information and output information of a second artificial neural network according to an embodiment of the present invention, and FIG. 7 includes a first artificial neural network and a second artificial neural network according to an embodiment of the present invention. It is a diagram showing the configuration of the first artificial neural network module.

도 6과 도 7를 참조하여 제2인공신경망(312)에 대해 구체적으로 설명하면, 일 실시예에 따른 제2인공신경망(312)은 제1인공신경망 (311)에서 출력된 여러 중간 정보를 합산한 제2입력 정보(30)를 입력 정보로 하고, 제1입력 정보(10)에 포함되어 있는 오디오 데이터에 프레임별로 보컬이 존재하는지 존재하지 않는지에 대한 보컬 정보를 포함하고 있는 제2출력 정보(40)를 출력 정보로 하는, 기 학습된 인공신경망 모듈을 의미한다. Referring to FIGS. 6 and 7, the second artificial neural network 312 will be described in detail. The second artificial neural network 312 according to an embodiment sums various intermediate information output from the first artificial neural network 311. Second output information including vocal information on whether or not a vocal exists for each frame in the audio data included in the first input information 30 as input information ( 40) as output information, and means a pre-learned artificial neural network module.

따라서, 도면에는 도시하지 않았지만 제2인공신경망(312)은 제2입력 정보(30)를 기초로 제2출력 정보(40)를 추론하는 추론 세션(미도시)과, 제2입력 정보(30) 와 제2출력 정보(40) 및 제2레퍼런스 데이트를 기초로 제2출력 정보(40)의 정확도를 높이는 방법으로 제2인공신경망(312)의 파라미터를 조정하고 업데이트 하는 학습 세션(미도시)을 포함할 수 있다. Therefore, although not shown in the figure, the second artificial neural network 312 includes an inference session (not shown) for inferring the second output information 40 based on the second input information 30, and the second input information 30 and a learning session (not shown) for adjusting and updating the parameters of the second artificial neural network 312 in a way to increase the accuracy of the second output information 40 based on the second output information 40 and the second reference data. can include

제2인공신경망(312)의 구조에 대해 구체적으로 알아보면, 도 7에 도시된 바와 같이 제2인공신경망(312)은 제2입력 정보(30)를 입력 받는 컨볼루션 블록(121)과 제2출력 정보(40)를 출력하는 제2LSTM 블록(122)을 포함할 수 있다.Looking at the structure of the second artificial neural network 312 in detail, as shown in FIG. 7, the second artificial neural network 312 includes a convolution block 121 receiving second input information 30 and A second LSTM block 122 outputting the output information 40 may be included.

구체적으로, 제2입력 정보(30)는 제1인공신경망 (311)을 구성하는 여러 블록들에서 출력되는 중간 정보들을 합한 정보를 입력 정보로 할 수 있는데, 구체적으로 제1레즈넷 블록(112)에서 출력된 제1중간 정보(11)와 제2레즈넷 블록(113)에서 출력된 제2중간 정보(12)와 제3레즈넷 블록(114)에서 출력된 제3중간 정보(13)와 폴링 블록(115)에서 출력되는 제4중간 정보(14)들의 합산 정보로 구현될 수 있다.Specifically, the second input information 30 may be information obtained by combining intermediate information output from various blocks constituting the first artificial neural network 311 as input information. Specifically, the first Resnet block 112 The first intermediate information 11 output from , the second intermediate information 12 output from the second Resnet block 113, the third intermediate information 13 output from the third Resnet block 114, and polling It may be implemented as summation information of the fourth intermediate information 14 output in block 115 .

다른 실시예로, 제2입력 정보(30)는 복수개의 맥스폴링 블록을 거친 정보로 구현될 수 있는데, 구체적으로 제2입력 정보(30)는 제1-1중간 정보(11)가 제1맥스폴링 블록인 제1MP 블록(131)을 거쳐서 생성된 제1-2정보(21)와, 제2-1중간 정보(12)가 제2맥스폴링 블록인 제2MP 블록(132)을 거처 생성된 제2-2정보(22)와, 제3-1중간 정보(13)가 제3맥스폴링 블록인 제3MP 블록(133)을 거친 제3-2중간 정보(23) 및 폴링 블록(115)을 거친 제4중간 정보(14)의 합으로 구성될 수 있다. 이렇게 구성된 제2입력 정보(30)는 컨볼루션 블록(121)을 거쳐 제2LSTM 블록(122)으로 입력되고, 최종적으로 제2출력 정보(40)로 출력될 수 있다. In another embodiment, the second input information 30 may be implemented as information that has passed through a plurality of max polling blocks. Specifically, the second input information 30 is the 1-1 intermediate information 11 The 1-2 information 21 generated through the 1 MP block 131 as a polling block and the 2-1 intermediate information 12 generated through the 2 MP block 132 as a second max polling block The 2-2 information 22 and the 3-1 intermediate information 13 pass through the 3-2 intermediate information 23 and the polling block 115 through the 3 MP block 133, which is the third max polling block. It can be composed of the sum of the fourth intermediate information (14). The second input information 30 configured in this way may be input to the second LSTM block 122 through the convolution block 121 and finally output as the second output information 40 .

도 6에서는 설명의 편의를 위해 제2인공신경망(312)는 컨볼루션 블록(121)과 제2LSTM 블록(122)만을 포함하는 것으로 도시하였으나, 발명의 실시예에 따라 제2인공신경망(312)은 도 6에 도시되어 있는 복수 개의 MP 블록들을 포함할 수 있으며, MP 블록의 수는 제1인공신경망(311)에 포함되어 있는 레즈넷 블록의 개수에 대응한 개수로 구현될 수 있다. 6 shows that the second artificial neural network 312 includes only the convolution block 121 and the second LSTM block 122 for convenience of description, but according to an embodiment of the present invention, the second artificial neural network 312 It may include a plurality of MP blocks shown in FIG. 6 , and the number of MP blocks may be implemented as a number corresponding to the number of Resnet blocks included in the first artificial neural network 311 .

제2LSTM 블록(122)에 따라 출력되는 정보인 제2출력 정보(40)는 입력된 오디오 데이터에 대한 보컬(Vocal) 정보를 포함할 수 있다.The second output information 40, which is information output according to the second LSTM block 122, may include vocal information about the input audio data.

구체적으로, 제2출력 정보(40)는 제1입력 정보(10)에 포함되어 있는 오디오 데이터를 기준으로, 각각의 프레임에 보컬 정보가 존재하는지에 대한 정보(V) 또는 존재하지 않는지에 대한 논보컬 정보(NV)를 포함할 수 있다. Specifically, the second output information 40 is information (V) on whether or not vocal information exists in each frame based on the audio data included in the first input information 10 or a discussion on whether or not vocal information exists. It may include vocal information (NV).

제2인공신경망(312)이 출력하는 제2출력 정보(40)는 제1인공신경망(311)이 출력하는 제1출력 정보에 사실상 포함되는 정보에 해당한다. 다만, 제1인공신경망(311)의 경우 오디오 데이터 안에 보컬의 높이 정보인 피치 정보를 출력하는 것에 초점에 맞추어져 있다면, 제2인공신경망(312)의 경우 오디오 데이터 안에 보컬이 존재하는 구간 및 존재하지 않는 구간에 대한 정보의 유무를 출력하는데 초점이 맞춰져 있는 인공신경망으로 이해할 수 있다. 따라서, 제2인공신경망(312)은 보컬 정보를 출력하는 정보의 특성에 따라 보컬 정보 출력 인공신경망으로도 지칭될 수 있다. The second output information 40 output by the second artificial neural network 312 corresponds to information substantially included in the first output information output by the first artificial neural network 311 . However, in the case of the first artificial neural network 311, if the focus is on outputting pitch information, which is the height information of vocals, in audio data, in the case of the second artificial neural network 312, the section and presence of vocals in audio data It can be understood as an artificial neural network focused on outputting the presence or absence of information about the section that does not. Accordingly, the second artificial neural network 312 may also be referred to as a vocal information output artificial neural network according to characteristics of information outputting vocal information.

도 8은 본 발명의 다른 실시예에 따른 제1인공신경망과 제2인공신경망 을 포함하고 있는 제1인공신경망 모듈을 도시한 도면이다. 8 is a diagram showing a first artificial neural network module including a first artificial neural network and a second artificial neural network according to another embodiment of the present invention.

도 8에 따른 제1인공신경망 모듈의 경우 기본적인 구성은 도 7에서 설명하였던 구성과 동일하나, 제1인공신경망(311)의 구성에 있어서 차이점이 존재한다. In the case of the first artificial neural network module according to FIG. 8, the basic configuration is the same as the configuration described in FIG. 7, but there is a difference in the configuration of the first artificial neural network 311.

구체적으로, 도 8에 따른 제1인공신경망 (311)은 도면에 도시된 바와 같이 도 7에 따른 제1인공신경망 (311)의 출력단에 제1LSTM 블록(116)에서 출력된 정보와 제2LSTM 블록(122)에서 출력되는 정보를 합산한 정보를 입력 정보로 하여, 제1인공신경망 (311)의 최종 정보를 출력하는 합산 블록(317)을 포함할 수 있다. 합산 블록(317)은 밀집층(Dense layer) 및 FC Layer로 구성되어 있어, 제1LSTM 블록(316)에서 출력하는 정보와 제2LSTM 블록(122)에서 출력하는 정보를 각각 합산한 후, 합산된 정보를 기초로 네트워크 연산을 수행하여, 제1출력 정보(20)를 출력 정보로 출력할 수 있다. Specifically, the first artificial neural network 311 according to FIG. 8, as shown in the figure, outputs the information output from the first LSTM block 116 and the second LSTM block ( 122) may include a summing block 317 for outputting final information of the first artificial neural network 311 by taking the summed information as input information. The summing block 317 is composed of a dense layer and an FC layer, and after summing the information output from the 1st LSTM block 316 and the information output from the 2nd LSTM block 122, respectively, the summed information By performing a network operation based on , it is possible to output the first output information 20 as output information.

또한, 도면에는 도시하지 않았지만 합산 블록(317)에서 출력되는 정보는 다시 제2인공신경망의 정보로 입력되어 제2인공신경망(312)이 보컬 정보를 출력하는데 활용될 수 있다. 구체적으로, 제2인공신경망(312)은 제2LSTM 블록(122)과 직렬 연결되어 있는 합산 블록을 더 포함하고 있어, 제2LSTM 블록(122)에서 출력된 정보와, 합산 블록(317)에서 출력된 정보를 합산한 후, 합산된 정보를 기초로 네트워크 연산을 수행하여, 제2출력 정보(40)를 최종 출력 정보로 출력할 수 있다. In addition, although not shown in the drawing, the information output from the summation block 317 may be input again as information of the second artificial neural network, and the second artificial neural network 312 may be used to output vocal information. Specifically, the second artificial neural network 312 further includes a summation block serially connected to the 2LSTM block 122, so that the information output from the 2LSTM block 122 and the summation block 317 After summing up the information, a network operation may be performed based on the summed information to output the second output information 40 as final output information.

도 8에 도시된 바와 같은 구성을 가지는 경우 제2인공신경망(312)에서 출력된 보컬 정보를, 피치 정보를 출력하는 제1인공신경망 (311)에서 다시 한번 중간 입력 정보로 활용할 수 있기 때문에, 입력되는 오디오 데이터에서 보컬 정보를 명확히 구분할 수 있다. 따라서, 이러한 정보를 활용하면 보컬에 대한 피치 정보를 더 명확하게 출력할 수 있는 장점이 존재한다. In the case of having the configuration shown in FIG. 8, since the vocal information output from the second artificial neural network 312 can be used as intermediate input information once again in the first artificial neural network 311 outputting pitch information, the input Vocal information can be clearly distinguished from the audio data to be used. Accordingly, there is an advantage in that pitch information for vocals can be output more clearly by using such information.

또한, 제1인공신경망(311)에서 출력한 피치 정보를, 보컬 정보를 출력하는 제2인공신경망(312)에서 다시 한번 중간 입력 정보로 활용할 수 있기 때문에, 입력되는 오디오 데이터에서 더 정확히 보컬 정보를 명확히 구분할 수 있는 장점이 존재한다. In addition, since the pitch information output from the first artificial neural network 311 can be utilized as intermediate input information once again in the second artificial neural network 312 that outputs vocal information, vocal information can be more accurately recorded in the input audio data. There are distinct advantages.

지금까지 본 발명에 따른 음악 채보 인공신경망 모듈에 해당하는 제1인공신경망 모듈의 구체적인 구성 및 프로세스에 대해 알아보았다. 이하 제1인공신경망 모듈의 학습 방법에 대해 알아본다.So far, the specific configuration and process of the first artificial neural network module corresponding to the artificial neural network module for music transcription according to the present invention have been studied. Hereinafter, the learning method of the first artificial neural network module will be described.

본 발명에 따른 제1인공신경망(311)과 제2인공신경망(312)은 각각 피치 정보와 보컬 정보를 추론함에 있어서, 각각의 인공신경망 모듈을 기초로 학습을 수행할 수 있다. 즉, 제1인공신경망(311)은 피치 정보만을 기초로 손실함수를 계산한 후, 제1레퍼런스 데이터를 이용하여 피치 정보의 정확성을 높이는 방향으로 학습을 수행할 수 있으며, 제2인공신경망(312)은 보컬 정보만을 기초로 손실함수를 계산한 후, 제2레퍼런스 데이터를 이용하여 피치 정보의 정확성을 높이는 방향으로 학습을 수행할 수 있다The first artificial neural network 311 and the second artificial neural network 312 according to the present invention may perform learning based on each artificial neural network module in inferring pitch information and vocal information, respectively. That is, the first artificial neural network 311 may calculate a loss function based only on pitch information, and then perform learning in a direction of increasing the accuracy of the pitch information using the first reference data, and the second artificial neural network 312 ) can calculate a loss function based only on vocal information, and then perform learning in the direction of increasing the accuracy of pitch information using the second reference data.

또한, 본 발명에 따른 제1인공신경망 모듈(310)은 제1인공신경망 (311)과 제2인공신경망(312)을 독립적으로 학습을 하는 것이 아니라, 제1인공신경망 (311)과 제2인공신경망(312)의 출력 정보들을 합산 한 후, 손실함수 또한 이를 기초로 계산하여 학습을 수행할 수 있다. 이를 도 9를 통하여 구체적으로 알아본다. In addition, the first artificial neural network module 310 according to the present invention does not independently learn the first artificial neural network 311 and the second artificial neural network 312, but the first artificial neural network 311 and the second artificial neural network 311. After summing the output information of the neural network 312, the learning may be performed by calculating a loss function based on the summation. This will be examined in detail through FIG. 9 .

도 9는 본 발명에 따른 음악 채보 인공신경망 모듈의 통합 손실함수를 계산하는 방법을 도시한 도면이다.9 is a diagram showing a method of calculating an integrated loss function of a music transcription artificial neural network module according to the present invention.

도 9를 참조하면, 본 발명에 따른 제1인공신경망 (311)의 제1손실함수(Lpitch)는 제1인공신경망 (311)에서 출력된 정보만을 기초로 구성될 수 있다. 즉, 도 9의 왼쪽에 도시된 바와 각각의 프레임에 대해 피치 정보가 있는지에 대한 정보(a1) 또는 피치 정보가 없는지에 대한 정보(a2)를 생성한 후, 생성된 정보와 제1레퍼런스 정보의 차이를 기초로 제1손실함수(Lpitch)를 생성한다.Referring to FIG. 9 , the first loss function (Lpitch) of the first artificial neural network 311 according to the present invention may be configured based only on information output from the first artificial neural network 311 . That is, as shown on the left side of FIG. 9, after generating information (a1) on whether pitch information is present or information (a2) on whether there is no pitch information for each frame, the generated information and the first reference information A first loss function (Lpitch) is generated based on the difference.

제2인공신경망(312)의 제2손실함수(LVocal)는 보컬 정보와 멜로디 정보를 기초로 생성될 수 있다. 즉, 도 9의 오른쪽에 도시된 바와 같이 제1손실함수를 구성하는 성분 중 피치 정보가 있는 구간(a1)에 대한 정보인 제1보컬 정보(V1)와 피치 정보가 없는 구간(사실상 보컬이 없는 구간,a2)에 대한 정보인 제1넌-보컬 정보(NV1)를 합친 후, 제2인공신경망 (311)에서 출력되는 정보 중 보컬 정보가 존재하는 구간에 대한 정보인 제2보컬 정보(V2)와 보컬 정보가 존재하지 않는 구간에 대한 정보인 제2넌-보컬 정보(NV2)를 모두 합산하여 제2손실함수(LVocal)를 생성할 수 있다.The second loss function LVocal of the second artificial neural network 312 may be generated based on vocal information and melody information. That is, as shown on the right side of FIG. 9, among the components constituting the first loss function, the first vocal information (V1), which is information about the section (a1) with pitch information, and the section without pitch information (actually no vocal) After combining the first non-vocal information (NV1), which is information about the section a2), the second vocal information (V2), which is information about the section in which vocal information exists, among the information output from the second artificial neural network 311 A second loss function (LVocal) may be generated by summing both LV and second non-vocal information (NV2), which is information about a section in which no vocal information exists.

그 후, 전체 인공신경망 모듈의 손실함수는 토탈 손실함수인 Ltotal은 제1손실함수(Lpitch)와 제2손실함수(LVocal)의 합으로 표현될 수 있는데 구체적으로 아래 식(1)과 같이 표현될 수 있다.Then, the loss function of the entire artificial neural network module can be expressed as the sum of the first loss function (Lpitch) and the second loss function (LVocal), which is the total loss function Ltotal. can

식 (1) - Ltotal = Lpitch * (a*LVocal)Equation (1) - Ltotal = Lpitch * (a*LVocal)

식 (1)에서 a는 계수를 의미하며, 본 발명의 경우 0.1, 0.5 또는 1이 적용될 수 있다. In Equation (1), a means a coefficient, and in the case of the present invention, 0.1, 0.5 or 1 may be applied.

이렇게 생성된 손실함수는 보컬 정보와 피치 정보를 모두 고려하여 전체 인공신경망의 파라미터를 조정하므로, 보다 정확하게 보컬 정보와 피치 정보를 출력할 수 있는 장점이 존재한다. Since the loss function thus generated adjusts parameters of the entire artificial neural network in consideration of both vocal and pitch information, there is an advantage in that vocal and pitch information can be output more accurately.

지금까지 보컬 정보와 피치 정보를 출력하는 제1인공신경망 모듈에 대해 자세히 알아보았다. 이하, 본 발명의 다른 실시예로서, 제1인공신경망 모듈을 출력 정보를 활용하여 노트 레벨의 채보 정보를 출력하는 제2인공신경망 모듈에 대해 자세히 알아본다.So far, we have looked into the first artificial neural network module that outputs vocal information and pitch information in detail. Hereinafter, as another embodiment of the present invention, a second artificial neural network module that outputs note-level transcript information by utilizing output information of the first artificial neural network module will be described in detail.

도 10은 일 실시예에 따른 인공신경망을 이용한 자동 음악 채보 장치의 구성 요소를 도시한 도면이고, 도 11은 일 실시예에 따른 자동 음악 채보 장치의 전처리 모듈을 설명하기 위한 도면이다.10 is a diagram showing components of an automatic music transcription device using an artificial neural network according to an embodiment, and FIG. 11 is a diagram for explaining a preprocessing module of the automatic music transcription device according to an embodiment.

도 10을 참조하면, 일 실시예에 따른 자동 음악 채보 장치는 프레임 레벨 단위의 오디오 데이터에 대한 피치 정보를 출력하는 제1인공신경망 모듈(310), 제1인공신경망 모듈(310)이 출력한 피치 정보를 기초로 상기 오디오 데이터에 대한 노트 레벨의 피치 정보로 데이터를 변환하여 제1학습 데이터를 생성하는 전처리 모듈(340), 상기 제1학습 데이터를 저장하는 제1메모리 모듈(210), 노트 레벨 단위의 피치 정보가 라벨링 되어 있는 제2학습 데이터를 저장하는 제2메모리 모듈(220), 제1학습 데이터 및 제2학습 데이터 중 적어도 하나의 데이터를 기초로 학습을 수행하며, 입력되는 프레임 레벨 단위의 오디오 데이터에 대해 프레임 레벨 단위의 피치 정보를 출력하는 제2인공신경망 모듈(320) 및 제2인공신경망 모듈에서 출력된 프레임 레벨 단위의 정보를 노트 레벨 단위의 정보를 변형하는 후처리 모듈(350)을 포함할 수 있다. Referring to FIG. 10, the automatic music transcription apparatus according to an embodiment includes a first artificial neural network module 310 outputting pitch information of audio data in frame-level units, and a pitch output by the first artificial neural network module 310. A pre-processing module 340 for generating first learning data by converting data into note-level pitch information for the audio data based on the information, a first memory module 210 for storing the first training data, and note level The second memory module 220 for storing the second training data labeled with unit pitch information, learning is performed based on at least one of the first training data and the second training data, and input frame level unit The second artificial neural network module 320 outputs pitch information in frame level units for the audio data of , and the post-processing module 350 transforms the frame level unit information output from the second artificial neural network module into note level unit information. ) may be included.

도 10에 따른 자동 음악 채보 장치에서 제1인공신경망 모듈(310)과 제2인공신경망 모듈(320) 앞선 도면을 통해 설명하였던 인공신경망 모듈과 그 구성이 대부분 동일하나, 하나의 실시예로서 제1인공신경망 모듈(310)은 도 7에 따른 인공신경망 모듈이 차용되고, 제2인공신경망 모듈(320)은 도 8에 따른 인공신경망 모듈이 차용될 수 있으며, 이러한 경우 제1인공신경망 모듈(310)은 제2인공신경망 모듈(320)이 학습을 하는데 필요한 학습 데이터를 생성하는 인공신경망의 역할을 하며, 제2인공신경망 모듈(320)이 실질적으로 자동 보컬 채보를 하는 역할을 수행할 수 있다. In the automatic music transcription device according to FIG. 10, the first artificial neural network module 310 and the second artificial neural network module 320 have almost the same configuration as the artificial neural network module described in the previous figure, but as one embodiment, the first artificial neural network module 310 The artificial neural network module 310 may borrow the artificial neural network module according to FIG. 7, and the second artificial neural network module 320 may borrow the artificial neural network module according to FIG. 8. In this case, the first artificial neural network module 310 plays a role of an artificial neural network generating learning data necessary for the second artificial neural network module 320 to learn, and the second artificial neural network module 320 may actually play a role of automatic vocal transcription.

제1인공신경망 모듈(310)과 제2인공신경망 모듈(320)에 대한 중복되는 설명은 생략하고, 다른 구성요소 및 전반적인 프로세스에 대해 설명하면, 전처리 모듈(340)은 제1인공신경망 모듈(310)이 출력하는 프레임 단위의 피치 정보를 가공하여 유사-노트(pseudo-note) 레벨의 피치 정보를 변환하는 역할을 수행할 수 있다.If overlapping descriptions of the first artificial neural network module 310 and the second artificial neural network module 320 are omitted, and other components and overall processes will be described, the preprocessing module 340 is the first artificial neural network module 310 ) may process pitch information in units of frames output by converting pseudo-note level pitch information.

구체적으로 전처리 모듈(340)은 도 10에 도시된 바와 같이 크게 2가지 피치 양자화(pitch quantization) 프로세스와 리듬 양자화(rhythm quantization)을 프로세스를 거쳐 프레임 단위의 피치 정보를 유사-노트 레벨의 피치 정보로 변환한다. 노트 레벨이 아닌 유사-노트 레벨로 호칭하는 이유는 데이터를 처음부터 정확하게 노트 레벨로 취합한 것이 아니고 프레임 단위의 피치 정보를 가공하여 노트 레벨로 변형한 것이기 때문에 정확한 노트 레벨의 데이터와는 차이가 존재하여 유사-노트 레벨이라고 호칭한다. Specifically, as shown in FIG. 10, the preprocessing module 340 converts pitch information in units of frames into pseudo-note level pitch information through largely two pitch quantization processes and rhythm quantization processes. convert The reason why it is called pseudo-note level rather than note level is that the data is not accurately collected at the note level from the beginning, but the pitch information in units of frames is processed and transformed into the note level, so there is a difference from the exact note level data. Therefore, it is called pseudo-note level.

전처리 모듈(340)이 수행하는 피치 양자화 프로세서는 연속 피치를 반음 단계로 반올림하는 피치 양자화 프로세스를 의미한다. 구체적으로, 인공신경망을 기반으로 하는 많은 음정 추정 모델에서 출력은 양자화된 음높이 값에 대한 소프트맥스 함수로 표현되며, 여기서 인접한 음높이는 반음보다 훨씬 작은 경향을 가지고 있다. 따라서, 가장 신뢰도 높은 피치를 가져와 가장 가까운 MIDI 음표 번호로 퀀타이즈 하는 것이 피치 양자화 과정이다.The pitch quantization process performed by the preprocessing module 340 refers to a pitch quantization process that rounds continuous pitches to semitone steps. Specifically, in many pitch estimation models based on artificial neural networks, the output is expressed as a softmax function for quantized pitch values, where adjacent pitches tend to be much smaller than semitones. Therefore, the pitch quantization process is to take the most reliable pitch and quantize it to the nearest MIDI note number.

피치 양자화 과정이 완료되면 전처리 모듈(340)은 그 다음 단계로 리듬 양자화 과정을 수행한다. 리듬 양자화 과정은 양자화된 피치 라인의 조각을 비트 기반 단위로 스냅(snap) 하는 과정을 의미한다. When the pitch quantization process is completed, the preprocessing module 340 performs a rhythm quantization process as a next step. The rhythm quantization process refers to a process of snapping pieces of the quantized pitch line in beat-based units.

일 예로 3개의 중간 필터로 양자화된 피치를 평탄화하는 과정을 수행 하며, 필터링된 출력을 비트 기반 단위로 만들기 위해 주어진 템포에서 중앙값 필터의 크기를 각각 1/32, 1/16 및 1/12 비트로 설정할 수 있다. 필터의 개수는 사용 환경에 따라 다르게 사용될 수 있으나, 3개의 계단식 필터가 라벨의 품질을 가장 향상시킨 것을 실험적으로 알 수 있었다. For example, the process of flattening the quantized pitch with three intermediate filters is performed, and the size of the median filter is set to 1/32, 1/16, and 1/12 beats respectively at a given tempo to make the filtered output in a beat-based unit. can The number of filters may be used differently depending on the usage environment, but it was found experimentally that the three cascaded filters improved the quality of the label the most.

리듬 양자화 과정이 완료되면, 전처리 모듈(340)은 가창으로 보기에는 짧은 작은 조각을 제거하였으며, 마지막으로 옥타브 오류를 최소화하는 간단한 규칙을 설정함으로써, 전처리 과정을 마무리하며, 이렇게 생성된 데이터는 제1메모리 모듈(210)에 저장되어 제2인공신경망 모듈이 학습할 때 학습 데이터로 사용될 수 있다. When the rhythm quantization process is completed, the pre-processing module 340 removes small pieces that are short to view as a song, and finally sets a simple rule to minimize octave errors to finish the pre-processing process. It is stored in the memory module 210 and can be used as learning data when the second artificial neural network module learns.

제2메모리 모듈(220)에 저장되는 데이터는 노트-레벨로 라벨링이 되어 있는 피치 정보를 포함하고 있는 데이터를 의미한다. Data stored in the second memory module 220 refers to data including pitch information labeled as a note-level.

제 2인공신경망 모듈(320)은 주파수 영역의 제2오디오 데이터를 제3입력 정보로 하고, 상기 제2오디오 데이터에 대해 프레임 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제3출력 정보(60)로 출력하는 제3인공신경망과, 제3인공신경망을 구성하는 블록에서 출력되는 중간 출력 정보를 제4입력 정보로 하고, 상기 제2오디오 데이터에 대해 프레임 단위로, 보컬(Vocal)의 존재 유무에 대한 보컬 정보를 제4출력 정보로 출력하는 제4인공신경망을 포함할 수 있다.The second artificial neural network module 320 takes second audio data in the frequency domain as third input information, and third outputs pitch information including vocal pitch information in frame units for the second audio data. The third artificial neural network output as information 60 and the intermediate output information output from the blocks constituting the third artificial neural network are used as fourth input information, and for the second audio data, in units of frames, vocal It may include a fourth artificial neural network that outputs vocal information about the presence or absence of as fourth output information.

제2인공신경망 모듈(320)의 제3인공신경망과 제4인공신경망은 제1인공신경망 모듈(310)의 제1인공신경망(311)과 제2인공신경망(312)에 각각 대응되는 인공신경망 모듈에 해당하여 대부분의 구성요소는 동일하다. 그러나, 앞서 언급한 바와 같이 하나의 실시예로서 제1인공신경망 모듈(310)은 도 7에 따른 인공신경망 모듈이 차용되고, 제2인공신경망 모듈(320)은 도 8에 따른 인공신경망 모듈이 차용될 수 있다.The third artificial neural network and the fourth artificial neural network of the second artificial neural network module 320 correspond to the first artificial neural network 311 and the second artificial neural network 312 of the first artificial neural network module 310, respectively. Correspondingly, most of the components are the same. However, as mentioned above, as one embodiment, the first artificial neural network module 310 borrows the artificial neural network module according to FIG. 7, and the second artificial neural network module 320 borrows the artificial neural network module according to FIG. 8. It can be.

제2인공신경망 모듈(320)의 학습 방법에 대해 설명하면, 제2인공신경망 모듈(320)은 학습 데이터로서 제1메모리 모듈(210)에 저장되어 있는 제1학습 데이터 및 제2메모리 모듈(320)에 저장되어 있는 제2학습 데이터 중 적어도 하나를 기초를 학습을 수행할 수 있다.Describing the learning method of the second artificial neural network module 320, the second artificial neural network module 320 uses the first learning data stored in the first memory module 210 as learning data and the second memory module 320. ) Learning may be performed based on at least one of the second learning data stored in ).

일반적으로, 보컬 채보를 진행함에 있어서 주파수 영역의 프레임 레벨의 채보보다는 온셋(Onset), 오프셋(Offset), 노트(note) 정보로 채보 정보를 출력하는 노트(note,음표) 레벨의 채보가 보다 정확한 채보 정보를 전달해 줄 수 있다. In general, in vocal transcription, note-level transcription, which outputs transcription information with onset, offset, and note information, is more accurate than frame-level transcription in the frequency domain. I can pass on the information of the chaebol.

그러나 노트 레벨의 채보는 노래 특성상 사람마다 같은 음을 다르게 부르는 경우도 존재하고, 다양한 변수가 존재하기 때문에 휴리스틱한 방법으로는 노트를 정확하게 예측하는 것은 매우 어렵다. 그리고 결정적으로 인공신경망을 이용하여 학습을 진행하기 위해서는 노트 레벨로 라벨링 되어 있는 데이터가 존재해야 하는데, 오디오와 정확하게 매칭되어 있는 악보 수준으로 라벨링 되어 있는 데이터가 매우 적어 인공신경망을 이용하여 학습을 수행하기는 매우 어려운 문제점이 존재한다. However, due to the nature of the song, each person sings the same note differently in the note-level transcription, and since there are various variables, it is very difficult to accurately predict the note using a heuristic method. Crucially, in order to proceed with learning using an artificial neural network, there must be data labeled at the note level, but there is very little data labeled at the level of a score that accurately matches the audio, making it difficult to perform learning using an artificial neural network. has a very difficult problem.

그러나, 본 발명에 따른 자동 보컬 채보 장치는 제1인공신경망 모듈(310) 및 전처리 모듈(340)을 활용하면 기존의 프레임 단위의 피치 정보를 실제 노트 레벨 단위와 유사한 성격을 가지고 있는 유사-노트 레벨의 피치 정보로 쉽게 변환할 수 있고, 제2인공신경망 모듈(320)은 이렇게 생성된 데이터를 기초로 학습을 수행할 수 있기 때문에, 출력되는 데이터는 노트 레벨의 데이터와 많이 유사한 성격을 가지고 있는 데이터를 생성할 수 있어, 노트 레벨의 자동 채보 장치의 효율성을 높일 수 있는 장점이 존재한다. However, the automatic vocal transcription apparatus according to the present invention utilizes the first artificial neural network module 310 and the pre-processing module 340 to convert existing frame-based pitch information to a pseudo-note level having a similar character to an actual note level unit. Since it can be easily converted into pitch information of , and the second artificial neural network module 320 can perform learning based on the data generated in this way, the output data is data that has characteristics very similar to note-level data. can be generated, so there is an advantage of increasing the efficiency of the note-level automatic transcribing device.

즉, 제2인공신경망 모듈(320)의 모델 자체는 구성 요소가 제1인공신경망 모듈(320)의 구성과 많이 유사하여 프레임 레벨로 결과를 예측하지만, 학습을 수행함에 있어서 유사-노트 레벨의 제1학습 데이터 및 노트 레벨의 제2학습 데이터로 학습을 수행하였고, 출력 정보를 출력하는 LSTM 단에서 시계열 학습이 함께 이루어지기 때문에 최종 출력 정보는 노트 레벨의 정보와 매우 유사한 정보의 성격을 가지고 있어, 사실상 노트 레벨의 채보 정보를 출력하는 효과를 얻을 수 있다. That is, the model of the second artificial neural network module 320 itself predicts the result at the frame level because its components are very similar to the configuration of the first artificial neural network module 320, but in performing learning, the similar-note level Learning was performed with the first learning data and the second learning data of the note level, and since time series learning is performed together at the LSTM stage that outputs the output information, the final output information has a very similar nature to the note level information, In fact, the effect of outputting transcription information at the note level can be obtained.

후처리 모듈(350)은 제2인공신경망 모듈(320)에서 출력되는 프레임 레벨의 출력 정보에 대해 노트 레벨의 채보 정보로 변환하는 역할을 할 수 있다. 사실상 후처리 모듈(350)은 앞선 설명한 전처리 모듈(340)이 수행한 피치 양자화(pitch quantization) 프로세스와 리듬 양자화(rhythm quantization)을 프로세스를 거쳐 프레임 단위의 피치 정보를 노트 레벨의 피치 정보로 변환하여 출력할 수 있다. 이에 대한 프로세서에 대해서는 앞서 자세히 기재하였는바 이하 생략하도록 한다. The post-processing module 350 may play a role of converting frame-level output information output from the second artificial neural network module 320 into note-level transcription information. In fact, the post-processing module 350 converts frame-based pitch information into note-level pitch information through the pitch quantization process and rhythm quantization process performed by the pre-processing module 340 described above, can be printed out. Since the processor for this has been described in detail above, it will be omitted below.

도 12는 다른 실시예에 따른 인공신경망을 이용한 자동 음악 채보 장치의 구성 요소를 도시한 도면이다.12 is a diagram illustrating components of an automatic music transcription device using an artificial neural network according to another embodiment.

도 12를 참조하면, 일 실시예에 따른 자동 음악 채보 장치는 프레임 레벨 단위의 오디오 데이터에 대한 피치 정보를 출력하는 제1인공신경망 모듈(310), 제1인공신경망 모듈(310)이 출력한 피치 정보를 기초로 상기 오디오 데이터에 대한 노트 레벨의 피치 정보로 데이터를 변환하여 제1학습 데이터를 생성하는 전처리 모듈(340), 상기 제1학습 데이터를 저장하는 제1메모리 모듈(210), 노트 레벨 단위의 피치 정보가 라벨링 되어 있는 제2학습 데이터를 저장하는 제2메모리 모듈(220), 제1학습 데이터 및 제2학습 데이터 중 적어도 하나의 데이터를 기초로 학습을 수행하며, 입력되는 프레임 레벨 단위의 오디오 데이터에 대한 노트 레벨 단위의 피치 정보를 출력하는 제2인공신경망 모듈(320) 및 제2인공신경망 모듈에서 출력된 프레임 레벨 단위의 정보를 노트 레벨 단위의 정보를 변형하는 제1후처리 모듈(351), 랜덤 노이즈 학습 데이터인 제3학습 데이터가 저장되어 있는 제3메모리 모듈(230), 제1후처리 모듈(351)에서 출력한 정보 및 제3학습 데이터 중 적어도 하나의 데이터를 기초로 학습을 수행하며, 입력되는 프레임 레벨 단위의 오디오 데이터에 대한 노트 레벨 단위의 피치 정보를 출력하는 제3인공신경망 모듈(330) 및 제3인공신경망 모듈(330)에서 출력된 프레임 레벨 단위의 정보를 노트 레벨 단위의 정보를 변형하는 제2후처리 모듈(352)을 포함할 수 있다.Referring to FIG. 12, the automatic music transcription apparatus according to an embodiment includes a first artificial neural network module 310 outputting pitch information of audio data in frame level units, and a pitch output by the first artificial neural network module 310. A pre-processing module 340 for generating first learning data by converting data into note-level pitch information for the audio data based on the information, a first memory module 210 for storing the first training data, and note level The second memory module 220 for storing the second training data labeled with unit pitch information, learning is performed based on at least one of the first training data and the second training data, and input frame level unit The second artificial neural network module 320 outputs pitch information in unit of note level for the audio data of , and the first post-processing module transforms the information in unit of frame level output from the second artificial neural network module into the information in unit of note level. (351), the third memory module 230 in which third learning data, which is random noise learning data, is stored, information output from the first post-processing module 351, and at least one of the third learning data. The third artificial neural network module 330 that performs learning and outputs pitch information in note level units for the input audio data in frame level units and the information in frame level units output from the third artificial neural network module 330 A second post-processing module 352 for modifying note-level information may be included.

도 12에 따른 자동 음악 채보 장치에서 제1인공신경망 모듈(310), 제2인공신경망 모듈(320) 및 제3인공신경망 모듈(330)은 앞선 도면을 통해 설명하였던 인공신경망 모듈과 그 구성이 대부분 동일하나, 하나의 실시예로서 제1인공신경망 모듈(310)은 도 7에 따른 인공신경망 모듈이 차용되고, 제2인공신경망 모듈(320)과 제3인공신경망 모듈(330)은 도 8에 따른 인공신경망 모듈이 차용될 수 있다. 이러한 경우 제1인공신경망 모듈(310)은 제2인공신경망 모듈(320)이 학습을 하는데 필요한 학습 데이터를 생성하는 인공신경망의 역할을 하며, 제2인공신경망 모듈(320)이 실질적으로 자동 보컬 채보를 하는 역할을 수행함과 동시에 제3인공신경망 모듈(330)이 학습하는 데이터를 출력하는 역할을 수행할 수 있다. 즉, 이러한 경우 제3인공신경망 모듈(330)이 학습을 수행함에 있어서, 제2인공신경망 모듈(320)은 보다 정확한 정보를 출력하는 선생님 역할을 하는 모델이 되며, 제3인공신경망 모듈(330)은 제2인공신경망 모듈(320)이 출력하는 정보에 기초하여 학습을 수행한다는 점에서 학생 역할을 모델을 할 수 있다. 즉, 도 12에 따른 자동 음악 채보 장치의 경우 준지도학습(Semi-supervised) 방법에 기초하여 제3인공신경망 모듈(330)이 학습을 수행하는 것을 특징으로 한다. The first artificial neural network module 310, the second artificial neural network module 320, and the third artificial neural network module 330 in the automatic music transcription device according to FIG. However, as an embodiment, the artificial neural network module according to FIG. 7 is borrowed as the first artificial neural network module 310, and the second artificial neural network module 320 and the third artificial neural network module 330 according to FIG. An artificial neural network module may be employed. In this case, the first artificial neural network module 310 serves as an artificial neural network that generates learning data necessary for the second artificial neural network module 320 to learn, and the second artificial neural network module 320 substantially automatically transcribes vocals. At the same time as performing a role of performing a role of outputting the data to be learned by the third artificial neural network module 330 may be performed. That is, in this case, when the third artificial neural network module 330 performs learning, the second artificial neural network module 320 becomes a model serving as a teacher outputting more accurate information, and the third artificial neural network module 330 may model the role of a student in that learning is performed based on information output by the second artificial neural network module 320 . That is, in the case of the automatic music transcription device according to FIG. 12, the third artificial neural network module 330 performs learning based on a semi-supervised learning method.

제1인공신경망 모듈(310), 제2인공신경망 모듈(320) 및 전처리 모듈(340)에 대한 중복되는 설명은 생략하며, 제1후처리 모듈(351)과 제2후처리 모듈(352) 또한 앞서 설명한 후처리 모듈(350)과 역할이 동일한바 이에 대한 자세한 설명은 생략하도록 한다. Redundant descriptions of the first artificial neural network module 310, the second artificial neural network module 320, and the pre-processing module 340 are omitted, and the first post-processing module 351 and the second post-processing module 352 are also Since it has the same role as the post-processing module 350 described above, a detailed description thereof will be omitted.

제 3인공신경망 모듈(330)은 주파수 영역의 제3오디오 데이터를 제5입력 정보로 하고, 상기 제3오디오 데이터에 대해 노트(note) 단위로 보컬의 음높이 정보를 포함하는 피치(Pitch) 정보를 제5출력 정보로 출력하는 제5인공신경망과, 제5인공신경망을 구성하는 블록에서 출력되는 중간 출력 정보를 제5입력 정보로 하고, 상기 제3오디오 데이터에 대해 프레임 단위로, 보컬(Vocal)의 존재 유무에 대한 보컬 정보를 제6출력 정보로 출력하는 제6인공신경망을 포함할 수 있다.The third artificial neural network module 330 takes third audio data in the frequency domain as fifth input information, and receives pitch information including pitch information of vocals in units of notes for the third audio data. A fifth artificial neural network output as fifth output information and intermediate output information output from a block constituting the fifth artificial neural network are set as fifth input information, and for the third audio data, in frame units, vocal It may include a sixth artificial neural network that outputs vocal information about the presence or absence of as sixth output information.

제3인공신경망 모듈(330)의 제5인공신경망과 제6인공신경망은 제2인공신경망 모듈(320)의 제3인공신경망과 제4인공신경망에 각각 대응되는 인공신경망에 해당하여 대부분의 구성요소는 동일하다. 그러나, 앞서 언급한 바와 같이 하나의 실시예로서 제1인공신경망 모듈(310)은 도 7에 따른 인공신경망 모듈이 차용되고, 제2인공신경망 모듈(320) 및 제3인공신경망 모듈(330)은 도 8에 따른 인공신경망 모듈이 차용될 수 있다.The fifth artificial neural network and the sixth artificial neural network of the third artificial neural network module 330 correspond to the artificial neural networks corresponding to the third artificial neural network and the fourth artificial neural network of the second artificial neural network module 320, respectively, and most of the components is the same However, as mentioned above, as one embodiment, the first artificial neural network module 310 borrows the artificial neural network module according to FIG. 7, and the second artificial neural network module 320 and the third artificial neural network module 330 An artificial neural network module according to FIG. 8 may be employed.

제3인공신경망 모듈(330)의 학습 방법에 대해 설명하면, 제2인공신경망 모듈(320)은 학습 데이터로서 제3메모리 모듈(210)에 저장되어 있는 제3학습 데이터 및 제2인공신경망 모듈(320)이 출력하는 데이터를 기초로, 제2인공신경망 모듈(320)을 선생님으로 하는 준지도학습을 수행하는 것에 특징이 존재한다.Describing the learning method of the third artificial neural network module 330, the second artificial neural network module 320 includes the third learning data stored in the third memory module 210 as learning data and the second artificial neural network module ( 320) is characterized in performing semi-supervised learning with the second artificial neural network module 320 as a teacher based on the output data.

본 발명에 따른 준지도학습은 여러 방법에 의해 진행될 수 있는데, 첫 번째로는 라벨링이 되어 있지 않은 데이터를 학습이 되어있는 제2인공신경망 모듈(320)과 제3인공신경망 모듈(330)에 각각 입력하여 출력되는 결과를 서로 비교하여 그 차이를 줄이는 방법으로 학습을 수행하는 방법이 있다. Semi-supervised learning according to the present invention can be performed in several ways. First, the second artificial neural network module 320 and the third artificial neural network module 330 are trained on unlabeled data, respectively. There is a method of performing learning by comparing input and output results with each other and reducing the difference.

두 번째 방법으로는, 라벨링이 되어 있지 않은 데이터에, 랜덤 노이지 데이터를 추가하여 혼합한 데이터를 학습이 되어있는 제2인공신경망 모듈(320)과 제3인공신경망 모듈(330)에 각각 입력하여 출력되는 결과를 서로 비교하여 그 차이를 줄이는 방법으로 학습을 수행하는 방법이 있다. In the second method, the mixed data by adding random noisy data to unlabeled data is input to the second artificial neural network module 320 and the third artificial neural network module 330, which have been trained, respectively, and outputs the data. There is a method of performing learning by comparing the results to each other and reducing the difference.

세 번째 방법으로는 제2인공신경망 모듈(320)에는 라벨링이 되어 있지 않은 데이터만 입력하고, 제3인공신경망 모듈(330)에는 라벨링이 되어 있지 않은 데이터에 랜덤 노이지 데이터를 추가하여 혼합한 데이터를 각각 입력하여 출력되는 결과를 서로 비교하여 그 차이를 줄이는 방법으로 학습을 수행하는 방법이 있다.In the third method, only unlabeled data is input to the second artificial neural network module 320, and random noisy data is added to the unlabeled data in the third artificial neural network module 330 to generate mixed data. There is a method of performing learning by comparing the output results of each input and reducing the difference.

상기 설명한 3가지 방법 모두 종래 기술에 비해 상대적으로 정확도가 높은 출력 결과를 얻었으며, 3가지 방법 중 학생 역할에 해당하는 제3인공신경망 모듈(330)에만 랜덤 노이즈 데이터를 추가한 3번째 방법이 가장 정확도가 높은 결과를 얻을 수 있었다. All three methods described above obtained output results with relatively high accuracy compared to the prior art, and among the three methods, the third method in which random noise data was added only to the third artificial neural network module 330 corresponding to the student role was the most High accuracy results were obtained.

지금까지 도면을 통해 일 실시예에 따른 인공신경망을 이용한 노트 레벨의 자동 보컬 채보 방법 및 장치에 대해 자세히 알아보았다.So far, the method and apparatus for automatic vocal transcription at the note level using an artificial neural network according to an embodiment have been studied in detail through the drawings.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may run an operating system (OS) and one or more software applications running on the operating system.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. 그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다. As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved. Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

100: 사용자 단말 110: 녹취부
120: 통신부 130: MIDI 생성부
140: 디스플레이부 200: 음악 채보 장치
300: 프로세서 310: 제1인공신경망 모듈
320: 제2인공신경망 모듈 400: 메모리 모듈
500: 부가 서비스 모듈 600: 사용자 관리 모듈100: user terminal 110: recording unit
120: communication unit 130: MIDI generation unit
140: display unit 200: music transcription device
300: processor 310: first artificial neural network module
320: second artificial neural network module 400: memory module
500: Additional service module 600: User management module

Claims

one or more processors; and
A memory module for storing instructions executable by the one or more processors; includes,
the processor,
A first step that uses first audio data in the frequency domain as first input information and outputs pitch information including pitch information of vocals in units of frames for the first audio data as first output information. artificial neural network;
A pre-processing module for converting the first output information into first learning data including vocal information in units of notes;
a third artificial neural network that receives second audio data in the frequency domain as third input information and outputs pitch information including vocal pitch information in frame units for the second audio data as third output information; and
A post-processing module for transforming the third output information into note-based vocal information;
The third artificial neural network,
Characterized in that learning is performed for the third artificial neural network based on the first learning data,
A vocal transcription device that performs learning based on note-level audio data.

According to claim 1,
The pre-processing module and the post-processing module,
Characterized in that pitch information in units of frames is converted into note-level pitch information using a pitch quantization method and a rhythm quantization method,
A vocal transcription device that performs learning based on note-level audio data.

According to claim 1,
a first memory module that stores the first learning data; and
A second memory module including second learning data including pitch information labeled at the note level; further comprising,
A vocal transcription device that performs learning based on note-level audio data.

According to claim 3,
The third artificial neural network,
Characterized in that learning is performed for the third artificial neural network based on the second learning data,
A vocal transcription device that performs learning based on note-level audio data.

one or more processors; and
A memory module for storing instructions executable by the one or more processors; includes,
the processor,
A first step that uses first audio data in the frequency domain as first input information and outputs pitch information including pitch information of vocals in units of frames for the first audio data as first output information. artificial neural network;
A pre-processing module for converting the first output information into first learning data including vocal information in units of notes;
a third artificial neural network that receives second audio data in the frequency domain as third input information and outputs pitch information including vocal pitch information in frame units for the second audio data as third output information;
a fifth artificial neural network that takes third audio data in the frequency domain as fifth input information and outputs pitch information including vocal pitch information in frame units with respect to the third audio data as fifth output information; and
A first post-processing module for converting the third output information into second learning data including vocal information in units of notes;
The third artificial neural network,
Characterized in that learning is performed based on at least one of the second learning data and the third learning data,
A vocal transcription device that performs learning based on note-level audio data.

According to claim 5,
A second post-processing module for transforming the fifth output information into note-based vocal information; further comprising,
A vocal transcription device that performs learning based on note-level audio data.

In the learning method of a vocal transcription device using a processor containing an artificial neural network,
The processor takes first audio data in the frequency domain as first input information, and outputs pitch information including vocal pitch information in units of frames for the first audio data as first output information. A first output information outputting step of outputting the first output information using a first artificial neural network that:
a pre-processing step of converting, by the processor, the first output information into first training data including vocal information in units of notes;
a learning step in which the processor performs learning on a third artificial neural network based on the first learning data;
The third artificial neural network in which the processor receives second audio data in the frequency domain as third input information and outputs pitch information including vocal pitch information in frame units for the second audio data as third output information. a third output information output step of outputting the third output information by using; and
Characterized in that it comprises a; post-processing step of the processor converting the third output information into vocal information in units of notes,
A learning method of a vocal transcription device that performs learning based on note-level audio data.