KR20220144117A

KR20220144117A - Apparatus and method for separating audio sources using denselstm

Info

Publication number: KR20220144117A
Application number: KR1020210050370A
Authority: KR
Inventors: 권오욱; 허운행
Original assignee: 충북대학교 산학협력단
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2022-10-26

Abstract

The present specification relates to a device and method for separating an audio source. The method for separating the audio source according to one embodiment of the present specification comprises: a step of input-receiving the learning data comprising the target data without a static noise and the static noise data with the static noise; a step of learning a target separation model by combining a feature map corresponding to each of a plurality of frequency bands regarding to a learning spectrogram based on the learning data; and a step of extracting a target signal using a mask filter and an input signal outputted from the target separation model. Therefore, the present invention is capable of having an improved audio source separation performance.

Description

Audio source separation apparatus and method using DenseLSTM

본 명세서는 오디오 소스 분리 장치 및 방법에 관한 것이다. 보다 상세하게는, DenseLSTM을 이용한 오디오 소스 분리 장치 및 방법에 관한 것이다.This specification relates to an audio source separation apparatus and method. More particularly, it relates to an apparatus and method for separating an audio source using DenseLSTM.

음향 신호를 타겟으로 이용하는 시스템에 심한 잡음과 같은 원하지 않는 신호가 혼합되면 시스템 성능이 저하되는 문제점이 발생한다. 그러나 사람은 실제 환경에서 여러가지 혼합된 신호를 동시에 들으면 원하는 신호에 주의 집중하여 선택적으로 들을 수 있다. 이러한 현상을 auditory scene analysis 또는 cocktail party effect 라고 한다.When an unwanted signal such as severe noise is mixed in a system using an acoustic signal as a target, system performance is degraded. However, when a person listens to several mixed signals in a real environment at the same time, he or she can concentrate on the desired signal and hear it selectively. This phenomenon is called auditory scene analysis or cocktail party effect.

그러나, 이러한 현상은 컴퓨터로 구현하기가 매우 어렵다. 최근 저작권 시장이 성장함에 따라 저작권료를 정확하게 정산하기 위한 기술 개발의 필요성이 많아지고 딥러닝 기술의 발달로 인하여 음성인식 기술의 수요가 많아지고 있다.However, this phenomenon is very difficult to implement with a computer. Recently, as the copyright market grows, the need for technology development for accurately calculating copyright fees increases, and the demand for voice recognition technology is increasing due to the development of deep learning technology.

따라서, 컴퓨터를 통해 혼합된 신호에서 원하는 신호를 정밀하게 분리해 내는 기술에 대한 필요성이 대두되고 있는 실정이다. Accordingly, there is a need for a technology for precisely separating a desired signal from a mixed signal through a computer.

본 명세서의 목적은 LSTM block을 적용하여 시계열 정보를 이용함으로써 개선된 오디오 소스 분리 성능을 가진 오디오 소스 분리 장치 및 방법을 제공하는 것이다.An object of the present specification is to provide an audio source separation apparatus and method having improved audio source separation performance by using time series information by applying an LSTM block.

또한, 본 명세서의 목적은 병렬로 연결된 time dilated convolution과 frequency dilated convolution을 포함하는 dilated block을 이용하여 수용 범위를 늘릴 수 있는 오디오 소스 분리 장치 및 방법을 제공하는 것이다.In addition, an object of the present specification is to provide an audio source separation apparatus and method capable of increasing the reception range by using a dilated block including a time dilated convolution and a frequency dilated convolution connected in parallel.

본 명세서의 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 본 명세서의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있고, 본 명세서의 실시예에 의해 보다 분명하게 이해될 것이다. 또한, 본 명세서의 목적 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.The objects of the present specification are not limited to the above-mentioned objects, and other objects and advantages of the present specification that are not mentioned may be understood by the following description, and will be more clearly understood by the examples of the present specification. It will also be readily apparent that the objects and advantages of the present specification may be realized by the means and combinations thereof indicated in the claims.

본 명세서의 일 실시예에 따른 오디오 소스 분리 방법은 잡음이 포함되지 않은 타겟 데이터와 잡음이 포함된 잡음 데이터를 포함하는 학습 데이터를 입력 받는 단계, 학습 데이터에 기반하는 학습 스펙트로그램에 대하여 복수의 주파수 밴드 각각에 대응되는 특징맵을 결합하여 타겟 분리 모델을 학습하는 단계, 타겟 분리 모델로부터 출력된 마스크 필터 및 입력 신호를 이용하여 타겟 신호를 추출하는 단계를 포함한다.A method of separating an audio source according to an embodiment of the present specification includes receiving training data including target data without noise and noise data including noise as input, and a plurality of frequencies for a training spectrogram based on the training data. Learning a target separation model by combining feature maps corresponding to each band, and extracting a target signal using a mask filter output from the target separation model and an input signal.

본 명세서의 일 실시예에서 복수의 주파수 밴드는 Low band, Middle band, High band 및 Full band를 포함한다.In an embodiment of the present specification, the plurality of frequency bands includes a low band, a middle band, a high band, and a full band.

본 명세서의 일 실시예에서 타겟 분리 모델을 학습하는 단계는 DenseNet구조에 기반하여 적어도 하나의 compression block, 적어도 하나의 DDB(Dilated Dense Block) 및 적어도 하나의 LSTM block(Long-Short Term Memory block)을 이용하여 타겟 분리 모델을 학습하는 단계를 포함한다.In an embodiment of the present specification, the step of learning the target separation model includes at least one compression block, at least one DDB (Dilated Dense Block) and at least one LSTM block (Long-Short Term Memory block) based on the DenseNet structure. and learning the target separation model using

본 명세서의 일 실시예에서 DDB는 Dilated block과 Dense block을 포함하고, Dilated block은 병렬로 배치된 FD Convolution(Frequency Dilated Convolution)과 TD Convolution(Time Dilated Convolution)을 포함한다.In one embodiment of the present specification, the DDB includes a dilated block and a dense block, and the dilated block includes FD Convolution (Frequency Dilated Convolution) and TD Convolution (Time Dilated Convolution) arranged in parallel.

본 명세서의 일 실시예에서 타겟 분리 모델을 학습하는 단계는 타겟 데이터 및 타겟 데이터와 상기 잡음 데이터가 혼합된 혼합 데이터 각각을 단시간 푸리에 변환(STFT, Short Term Fourier Transform)을 통해 제1 학습 스펙트로그램 및 제2 학습 스펙트로그램으로 변환하는 단계, 절대값을 이용하여 상기 제1 학습 스펙트로그램 및 제2 학습 스펙트로그램 각각의 크기를 산출하는 단계, 산출된 제1 학습 스펙트로그램 및 제2 학습 스펙트로그램의 크기를 입력으로 하여 상기 타겟 분리 모델을 학습하는 단계를 포함한다.In an embodiment of the present specification, the step of learning the target separation model includes first learning spectrogram and converting into a second learning spectrogram, calculating the sizes of the first and second learning spectrograms using absolute values, and the calculated sizes of the first and second learning spectrograms and learning the target separation model by receiving as an input.

본 명세서의 일 실시예에서 타겟 신호를 추출하는 단계는 입력 신호를 단시간 푸리에 변환을 통해 입력 스펙트로그램으로 변환하는 단계, 입력 스펙트로그램의 절대값인 입력 스펙트로그램의 크기를 산출하는 단계, 입력 스펙트로그램의 크기를 이용하여 상기 타겟 분리 모델로부터 마스크 필터를 획득하는 단계, 입력 스펙트로그램 및 상기 마스크 필터를 곱하고 상기 입력 스펙트로그램의 위상 정보를 적용하여 수정 스펙트로그램을 생성하는 단계, 생성된 수정 스펙트로그램에 대하여 역 단시간 푸리에 변환(ISTFT, Inverse STFT)을 수행하여 타겟 신호를 추출하는 단계를 포함한다.In an embodiment of the present specification, extracting the target signal includes converting the input signal into an input spectrogram through a short-time Fourier transform, calculating the size of the input spectrogram that is an absolute value of the input spectrogram, and the input spectrogram. obtaining a mask filter from the target separation model using the magnitude of, multiplying an input spectrogram and the mask filter and applying the phase information of the input spectrogram to generate a corrected spectrogram; and extracting a target signal by performing an inverse short-time Fourier transform (ISTFT, Inverse STFT).

본 명세서의 일 실시예에 따른 오디오 소스 분리 장치는 잡음이 포함되지 않은 타겟 데이터와 잡음이 포함된 잡음 데이터를 포함하는 학습 데이터를 입력 받는 데이터 입력부, 학습 데이터에 기반하는 학습 스펙트로그램에 대하여 복수의 주파수 밴드 각각에 대응되는 특징맵을 결합하여 타겟 분리 모델을 학습하는 모델 학습부, 타겟 분리 모델로부터 출력된 마스크 필터 및 입력 신호를 이용하여 타겟 신호를 추출하는 신호 추출부를 포함한다.An audio source separation apparatus according to an embodiment of the present specification includes a data input unit that receives training data including target data without noise and noise data including noise, and a plurality of learning spectrograms based on the training data. It includes a model learning unit for learning a target separation model by combining feature maps corresponding to each frequency band, and a signal extraction unit for extracting a target signal using a mask filter output from the target separation model and an input signal.

본 명세서의 일 실시예에서 모델 학습부는 DenseNet구조에 기반하여 적어도 하나의 compression block, 적어도 하나의 DDB(Dilated Dense Block) 및 적어도 하나의 LSTM block(Long-Short Term Memory block)을 이용하여 타겟 분리 모델을 학습하는 단계를 포함한다.In an embodiment of the present specification, the model learning unit uses at least one compression block, at least one DDB (Dilated Dense Block), and at least one LSTM block (Long-Short Term Memory block) based on the DenseNet structure to separate the target model It includes the step of learning.

본 명세서의 일 실시예에서 DDB는 Dilated block과 Dense block을 포함하고, 상기 Dilated block은 병렬로 배치된 FD Convolution(Frequency Dilated Convolution)과 TD Convolution(Time Dilated Convolution)을 포함한다.In one embodiment of the present specification, the DDB includes a dilated block and a dense block, and the dilated block includes FD Convolution (Frequency Dilated Convolution) and TD Convolution (Time Dilated Convolution) arranged in parallel.

본 명세서의 일 실시예에서 모델 학습부는 타겟 데이터 및 상기 타겟 데이터와 상기 잡음 데이터가 혼합된 혼합 데이터 각각을 단시간 푸리에 변환(STFT, Short Term Fourier Transform)을 통해 제1 학습 스펙트로그램 및 제2 학습 스펙트로그램으로 변환하고, 절대값을 이용하여 상기 제1 학습 스펙트로그램 및 제2 학습 스펙트로그램 각각의 크기를 산출하고, 상기 산출된 제1 학습 스펙트로그램 및 제2 학습 스펙트로그램의 크기를 입력으로 하여 상기 타겟 분리 모델을 학습한다.In an embodiment of the present specification, the model learning unit converts each of the target data and the mixed data in which the target data and the noise data are mixed into a first learning spectrogram and a second learning spectrogram through a Short Term Fourier Transform (STFT). gram, calculate the size of each of the first learning spectrogram and the second learning spectrogram using absolute values, and input the calculated sizes of the first learning spectrogram and the second learning spectrogram as an input to the Train the target separation model.

본 명세서의 일 실시예에서 신호 추출부는 입력 신호를 단시간 푸리에 변환을 통해 입력 스펙트로그램으로 변환하고, 상기 입력 스펙트로그램의 절대값인 입력 스펙트로그램의 크기를 산출하고, 상기 입력 스펙트로그램의 크기를 이용하여 상기 타겟 분리 모델로부터 마스크 필터를 획득하고, 상기 입력 스펙트로그램 및 상기 마스크 필터를 곱하고 상기 입력 스펙트로그램의 위상 정보를 적용하여 수정 스펙트로그램을 생성하고, 상기 생성된 수정 스펙트로그램에 대하여 역 단시간 푸리에 변환(ISTFT, Inverse STFT)을 수행하여 타겟 신호를 추출한다.In an embodiment of the present specification, the signal extractor converts the input signal into an input spectrogram through a short-time Fourier transform, calculates the size of the input spectrogram that is the absolute value of the input spectrogram, and uses the size of the input spectrogram to obtain a mask filter from the target separation model, multiply the input spectrogram and the mask filter, and apply the phase information of the input spectrogram to generate a corrected spectrogram, and inverse short-time Fourier for the generated modified spectrogram Transform (ISTFT, Inverse STFT) is performed to extract a target signal.

본 명세서의 일 실시예에 따른 오디오 소스 분리 장치 및 방법은 LSTM block을 적용하여 시계열 정보를 이용함으로써 개선된 오디오 소스 분리 성능을 가질 수 있다.The audio source separation apparatus and method according to an embodiment of the present specification may have improved audio source separation performance by using time series information by applying an LSTM block.

또한, 본 명세서의 일 실시예에 따른 오디오 소스 분리 장치 및 방법은 병렬로 연결된 time dilated convolution과 frequency dilated convolution을 포함하는 dilated block을 이용하여 수용 범위를 늘릴 수 있다.In addition, the audio source separation apparatus and method according to an embodiment of the present specification may increase the reception range by using a dilated block including a time dilated convolution and a frequency dilated convolution connected in parallel.

도 1은 본 명세서의 일 실시예에 따른 오디오 소스 분리 장치의 구성도이다.
도 2는 오디오 소스 분리 장치의 세부 구성을 나타낸 블록도이다.
도 3은 본 명세서의 타겟 분리 모델을 구체적으로 나타낸 블록도이다.
도 4는 타겟 분리 모델 중 full band의 서브 모델을 나타낸 도면이다.
도 5는 DDB의 내부 구조를 상세히 나타낸 도면이다.
도 6은 Composite function의 내부 구조를 나타낸 도면이다.
도 7은 FD Convolution 및 TD Convolution을 나타낸 도면이다.
도 8은 compression block의 내부 구조를 나타낸 도면이다.
도 9는 LSTM block의 내부 구조를 나타낸 도면이다.1 is a block diagram of an apparatus for separating an audio source according to an embodiment of the present specification.
2 is a block diagram illustrating a detailed configuration of an apparatus for separating an audio source.
3 is a block diagram specifically showing the target separation model of the present specification.
4 is a diagram illustrating a sub-model of a full band among target separation models.
5 is a diagram showing the internal structure of the DDB in detail.
6 is a diagram illustrating an internal structure of a composite function.
7 is a diagram illustrating FD convolution and TD convolution.
8 is a diagram illustrating an internal structure of a compression block.
9 is a diagram showing the internal structure of an LSTM block.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention. In describing each figure, like reference numerals have been used for like elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component. and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being “connected” or “connected” to another component, it may be directly connected or connected to the other component, but it is understood that other components may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It is to be understood that this does not preclude the possibility of the presence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세하게 설명한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 명세서의 일 실시예에 따른 오디오 소스 분리 장치의 구성도이며, 도 2는 오디오 소스 분리 장치의 세부 구성을 나타낸 블록도이다. 이하, 도 1 및 도 2를 참조하여 설명하도록 한다.1 is a configuration diagram of an apparatus for separating an audio source according to an embodiment of the present specification, and FIG. 2 is a block diagram illustrating a detailed configuration of an apparatus for separating an audio source. Hereinafter, it will be described with reference to FIGS. 1 and 2 .

도면을 참조하면, 오디오 소스 분리 장치(100)는 데이터 입력부(110), 모델 학습부(120) 및 신호 추출부(130)를 포함한다.Referring to the drawing, the audio source separation apparatus 100 includes a data input unit 110 , a model learning unit 120 , and a signal extraction unit 130 .

데이터 입력부(110)는 후술할 모델 학습부(120)를 통해 학습시킬 학습 데이터를 입력받는다. 구체적으로 학습데이터는 잡음이 포함되지 않은 타겟 데이터(Target dataset, 112)와 잡음이 포함된 잡음 데이터(Non-target dataset)를 포함한다. 여기서, 타겟 데이터는 하나이며 잡음 데이터는 복수일 수 있다. 또한, 학습 데이터는 음향 데이터 또는 음성 데이터일 수 있으며 이에 한정되지 않고 소리가 담긴 오디오 데이터를 모두 포함한다. The data input unit 110 receives learning data to be learned through a model learning unit 120 to be described later. Specifically, the training data includes target data 112 that does not include noise and noise data that includes noise (Non-target dataset). Here, the target data may be one and the noise data may be plural. In addition, the learning data may be sound data or voice data, but is not limited thereto, and includes all audio data including sound.

모델 학습부(120)는 학습 데이터에 기반하여 타겟 분리 모델을 학습시킨다. 구체적으로 모델 학습부(120)는 타겟 데이터(112) 및 상기 타겟 데이터와 상기 잡음 데이터가 혼합된 혼합 데이터(122) 각각을 단시간 푸리에 변환(STFT, Short Term Fourier Transform)을 통해 제1 학습 스펙트로그램 및 제2 학습 스펙트로그램으로 변환한다. 이때, 스펙트로그램이란 파형과 스펙트럼의 특징이 결합되어 소리나 파동을 시각화하여 파악하기 위한 도구로 X축은 시간(time), Y축은 주파수(frequency)를 나타낸다.The model learning unit 120 trains the target separation model based on the training data. Specifically, the model learning unit 120 performs a first learning spectrogram for each of the target data 112 and the mixed data 122 in which the target data and the noise data are mixed through a Short Term Fourier Transform (STFT). and a second learning spectrogram. In this case, the spectrogram is a tool for visualizing and understanding sound or waves by combining the characteristics of a waveform and a spectrum. The X axis represents time and the Y axis represents frequency.

이후, 모델 학습부(120)는 절대값을 이용하여 제1 학습 스펙트로그램 및 제2 학습 스펙트로그램 각각의 크기를 산출한다. 또한, 모델 학습부(120)는 산출된 제1 학습 스펙트로그램 및 제2 학습 스펙트로그램의 크기를 입력으로 하여 타겟 분리 모델을 학습(separation model training)시켜 타겟 분리 모델(Target separation model)(124)을 생성한다.Thereafter, the model learning unit 120 calculates the size of each of the first learning spectrogram and the second learning spectrogram by using the absolute value. In addition, the model learning unit 120 trains the target separation model by inputting the calculated sizes of the first learning spectrogram and the second learning spectrogram as an input to perform a target separation model (Target separation model) 124 . create

이와 같은 과정을 통해 학습된 타겟 분리 모델(124)을 이용하면 후술할 바와 같이 입력 신호로부터 추출하고자 하는 신호인 타겟 신호를 추출할 수 있다.If the target separation model 124 learned through this process is used, a target signal, which is a signal to be extracted from the input signal, can be extracted as will be described later.

신호 추출부(130)는 타겟 분리 모델로부터 출력된 마스크 필터 및 입력 신호를 이용하여 타겟 신호를 추출한다. 구체적으로, 신호 추출부(130)는 입력 신호를 단시간 푸리에 변환을 통해 입력 스펙트로그램으로 변환하고, 입력 스펙트로그램의 절대값인 입력 스펙트로그램의 크기를 산출한다. The signal extractor 130 extracts a target signal by using the mask filter output from the target separation model and the input signal. Specifically, the signal extractor 130 converts the input signal into an input spectrogram through a short-time Fourier transform, and calculates the size of the input spectrogram, which is an absolute value of the input spectrogram.

이후, 신호 추출부(130)가 산출된 입력 스펙트로그램의 크기를 이용하여 상기 타겟 분리 모델(124)로부터 마스크 필터(132)를 획득하면, 신호 추출부(130)는 입력 스펙트로그램 및 마스크 필터를 곱하고 상기 입력 스펙트로그램의 위상 정보를 적용하여 수정 스펙트로그램을 생성한다.Then, when the signal extractor 130 obtains the mask filter 132 from the target separation model 124 using the calculated size of the input spectrogram, the signal extractor 130 selects the input spectrogram and the mask filter. Multiply and apply the phase information of the input spectrogram to generate a modified spectrogram.

구체적으로 신호 추출부(130)는 입력 스펙트로그램의 크기를 타겟 분리 모델(124)의 입력으로 사용하고, 타겟 분리 모델(124)을 통해 마스크 필터(132)를 획득한다. 마스크 필터(132)는 원치 않는 소리를 은닉하기 위해 입력 신호에 백색 잡음이나 분홍 잡음 같은 원하지 않는 소리를 제거하는 필터이다. 입력 신호에 마스트 필터(132)를 사용하면 타겟 신호를 쉽게 분리할 수 있다.Specifically, the signal extractor 130 uses the size of the input spectrogram as an input of the target separation model 124 , and obtains the mask filter 132 through the target separation model 124 . The mask filter 132 is a filter that removes unwanted sounds, such as white noise or pink noise, from the input signal in order to hide the unwanted sound. If the mast filter 132 is used for the input signal, the target signal can be easily separated.

즉, 신호 추출부(130)는 입력 신호로부터 변환된 입력 스펙트로그램을 마스크 필터(132)와 곱하여 수정 스펙트로그램을 생성한다. 이때, 입력 신호의 위상(phase) 값이 필요하기 때문에 입력 스펙트로그램을 마스크 필터(132)와 곱할 때, 입력 스펙트로그램의 위상 정보를 함께 적용하여 수정 스펙트로그램을 생성한다.That is, the signal extractor 130 generates a corrected spectrogram by multiplying the input spectrogram converted from the input signal by the mask filter 132 . At this time, since the phase value of the input signal is required, when the input spectrogram is multiplied by the mask filter 132, the phase information of the input spectrogram is also applied to generate a corrected spectrogram.

신호 추출부(130)는 이와 같이 생성된 수정 스펙트로그램에 대하여 역 단시간 푸리에 변환(ISTFT, Inverse Short Term Fourier Transform)을 수행함으로써 타겟 신호를 추출할 수 있다.The signal extractor 130 may extract a target signal by performing an Inverse Short Term Fourier Transform (ISTFT) on the modified spectrogram thus generated.

도 3은 본 명세서의 타겟 분리 모델을 구체적으로 나타낸 블록도이고, 도 4는 타겟 분리 모델 중 full band의 서브 모델을 나타낸 도면이다. 이하, 도 3 및 도 4를 참조하여 설명하도록 한다.3 is a block diagram specifically illustrating a target separation model of the present specification, and FIG. 4 is a diagram illustrating a full band sub-model of the target separation model. Hereinafter, it will be described with reference to FIGS. 3 and 4 .

모델 학습부(120)는 학습 데이터에 기반하는 학습 스펙트로그램에 대하여 복수의 주파수 밴드 각각에 대응되는 특징맵을 결합하여 타겟 분리 모델을 학습한다.The model learning unit 120 learns a target separation model by combining a feature map corresponding to each of a plurality of frequency bands with respect to a learning spectrogram based on the training data.

도 3을 참조하면, 학습 스펙트로그램(123)이 타겟 분리 모델에 입력으로 사용된다. 학습 스펙트로그램의 Y축은 주파수 값을 가지며, 일반적으로 음향 신호는 주파수 대역 별로 서로 다른 특성의 패턴을 갖는다. 따라서 모델 학습부(120)는 학습 스펙트로그램(123)의 주파수 밴드를 낮은 주파수 대역부터 Low band, Middle band, High band로 분류한다. 따라서, 하나의 학습 스펙트로그램(123)은 복수의 주파수 밴드를 포함할 수 있고, 복수의 주파수 밴드는 병렬로 배치될 수 있다. 복수의 주파수 밴드는 예를 들어, Low band, Middle band, High band 및 Full band를 포함할 수 있다.Referring to FIG. 3 , a training spectrogram 123 is used as an input to the target separation model. The Y-axis of the learning spectrogram has a frequency value, and in general, an acoustic signal has a pattern of different characteristics for each frequency band. Accordingly, the model learning unit 120 classifies the frequency band of the training spectrogram 123 into a low band, a middle band, and a high band from a low frequency band. Accordingly, one learning spectrogram 123 may include a plurality of frequency bands, and the plurality of frequency bands may be arranged in parallel. The plurality of frequency bands may include, for example, a low band, a middle band, a high band, and a full band.

본 명세서의 모델 학습부(120)는 상술한 바와 같이 하나의 스펙트로그램(123)에 대해 복수의 주파수 밴드로 분류함으로써 각 주파수 밴드의 특성에 맞는 서브 모델을 배치하고, 효과적인 타겟 분리 모델 학습을 수행할 수 있다. As described above, the model learning unit 120 of the present specification classifies one spectrogram 123 into a plurality of frequency bands to arrange sub-models suitable for the characteristics of each frequency band, and perform effective target separation model learning. can do.

또한, 모델 학습부(120)는 Low band, Middle band 및 High band를 Full band와 함께 하나의 텐서로 결합함으로써 주파수 밴드의 경계에서 발생하는 왜곡을 제거할 수 있다.In addition, the model learning unit 120 may remove the distortion occurring at the boundary of the frequency band by combining the low band, the middle band, and the high band together with the full band into one tensor.

주파수 밴드 각각은 서로 다른 서브 모델을 가지며, 모델 학습부(120)는 서브 모델을 통해 각각의 주파수 밴드에 대응되는 특징맵을 출력한다. 모델 학습부(120)는 출력된 특징맵(feature map)을 하나의 텐서(tensor)(125)로 결합하여 마스크 필터(132)를 출력한다. 이때, 출력된 마스크 필터(132)는 입력인 학습 스펙트로그램(123)의 크기와 대응되는 크기를 가질 수 있다. Each frequency band has a different sub-model, and the model learning unit 120 outputs a feature map corresponding to each frequency band through the sub-model. The model learning unit 120 outputs a mask filter 132 by combining the output feature map into one tensor 125 . In this case, the output mask filter 132 may have a size corresponding to the size of the input learning spectrogram 123 .

한편, 각각의 서브 모델은 DenseNet(Densely connected convolutional networks)구조에 기반하여 적어도 하나의 compression block, 적어도 하나의 DDB(Dilated Dense Block) 및 적어도 하나의 LSTM block(Long-Short Term Memory block)을 포함한다.On the other hand, each sub-model includes at least one compression block, at least one DDB (Dilated Dense Block), and at least one LSTM block (Long-Short Term Memory block) based on a densely connected convolutional networks (DenseNet) structure. .

여기서, dense block은 CNN(Convolutional Neural Network)에서 사용되는 이미지 모델 블록이며 DDB는 dense block이 확장된 형태로 학습 스펙트로그램으로부터 특징맵을 출력한다. 자세한 DDB의 내부 구조는 후술하여 상세히 설명하도록 한다.Here, the dense block is an image model block used in CNN (Convolutional Neural Network), and the DDB outputs a feature map from the learning spectrogram in the form of an expanded dense block. The detailed internal structure of the DDB will be described later in detail.

compression block은 DDB에서 출력된 특징맵의 정보를 압축하여 특징맵의 수를 감축한다. 또한, LSTM block은 학습 스펙트로그램의 시간축을 따라 시계열 정보를 이용하여 타겟 정보를 출력 및 저장한다.The compression block reduces the number of feature maps by compressing the information of the feature maps output from the DDB. In addition, the LSTM block outputs and stores target information using time series information along the time axis of the learning spectrogram.

도 4를 참조하면, 서브 모델의 예시로써 full band의 서브 모델에는 9개의 DDB, 9개의 compression block 및 2개의 LSTM block이 포함된다. 즉, full band의 서브 모델은 다수의 dense block을 사용하여 다운 샘플링과 업 샘플링 과정을 통해 멀티 스케일 특징을 얻는 구조를 갖는다. 모델 학습부(120)는 서브 모델에 입력으로 학습 스펙트로그램을 입력하면 입력 스펙트로그램과 곱해질 마스크 필터를 출력으로 획득한다.Referring to FIG. 4 , as an example of a sub model, a full band sub model includes 9 DDBs, 9 compression blocks, and 2 LSTM blocks. That is, the full band sub-model has a structure in which multi-scale features are obtained through down-sampling and up-sampling processes using a large number of dense blocks. The model learning unit 120 obtains, as an output, a mask filter to be multiplied by the input spectrogram when a learning spectrogram is input to the sub-model.

구체적으로, 모델 학습부(120)가 마스크 필터를 출력으로 획득하는 과정에서 복수의 다운 샘플링 및 복수의 업 샘플링이 발생한다. 다운 샘플링은 압축된 정보를 얻기 위해 수행되며, 신호 분리가 회귀 테스크임에 따라 네트워크의 출력이 입력 크기와 같아야 하므로 다운 샘플링 이후에는 업 샘플링 과정이 반드시 필요하다. 여기서 다운 샘플링은 2 x 2 커널의 average pooling을 이용할 수 있고, 업 샘플링은 2 x 2 커널의 transposed convolution을 이용할 수 있다.Specifically, a plurality of down-sampling and a plurality of up-sampling occur while the model learning unit 120 obtains the mask filter as an output. Downsampling is performed to obtain compressed information, and as signal separation is a regression task, the output of the network must be equal to the input size, so the upsampling process is absolutely necessary after downsampling. Here, downsampling may use average pooling of a 2x2 kernel, and upsampling may use transposed convolution of a 2x2 kernel.

본 명세서의 일 실시예에서 모델 학습부(120)는 다운 샘플링 및 업 샘플링 과정에서 총6개의 LSTM block을 이용하여 타겟 정보를 출력하고 저장한다. LSTM block은 학습 스펙트로그램의 시간축을 따라 시계열 정보를 이용하여 타겟 정보를 출력하므로 더 자세한 타겟 정보의 출력이 가능하다.In an embodiment of the present specification, the model learning unit 120 outputs and stores target information using a total of six LSTM blocks in the downsampling and upsampling processes. Since the LSTM block outputs target information using time series information along the time axis of the learning spectrogram, it is possible to output more detailed target information.

도 5는 DDB의 내부 구조를 상세히 나타낸 도면이고, 도 6은 Composite function의 내부 구조를 나타낸 도면이고, 도 7은 FD Convolution 및 TD Convolution을 나타낸 도면이고, 도 8은 compression block의 내부 구조를 나타낸 도면이고, 도 9는 LSTM block의 내부 구조를 나타낸 도면이다. 이하 도 5 내지 도 9를 참조하여 DDB 및 LSTM block의 내부 구조를 설명하도록 한다.5 is a diagram showing the internal structure of the DDB in detail, FIG. 6 is a diagram showing the internal structure of the composite function, FIG. 7 is a diagram showing FD convolution and TD convolution, and FIG. 8 is a diagram showing the internal structure of a compression block and FIG. 9 is a view showing the internal structure of the LSTM block. Hereinafter, the internal structures of the DDB and LSTM blocks will be described with reference to FIGS. 5 to 9 .

도 5를 참조하면, DDB는 Dilated block 및 Dense block을 포함한다. 여기서 Dense block은 복수의 composite function(비선형 변환 함수)로 구성될 수 있다. Dense block은 DenseNet 구조로 복수의 composite function의 입력과 출력의 특징맵이 연결되어 있을 수 있다. 이러한 DenseNet 구조를 통해 피드 포워드(feed-forward) 과정에서 각 함수의 출력이 누적되어 많은 정보가 포함될 수 있고, 오류 역전파(back-propagation) 과정에서 기울기 소실 문제(vanishing gradient problem)를 해결할 수 있는 장점이 있다.Referring to FIG. 5 , the DDB includes a dilated block and a dense block. Here, the dense block may be composed of a plurality of composite functions (non-linear transformation functions). Dense block is a DenseNet structure, and feature maps of input and output of a plurality of composite functions may be connected. Through this DenseNet structure, the output of each function is accumulated in the feed-forward process, so it can contain a lot of information, and it can solve the vanishing gradient problem in the error back-propagation process. There are advantages.

DenseNet구조는 하기의 식 1에 의해 표현될 수 있다.The DenseNet structure can be expressed by Equation 1 below.

<식 1><Equation 1>

여기서,

은

layer의 출력이자

layer의 입력이고,

는 composite function,

는 0부터

층까지의 출력 특징맵이 연결된 것을 의미한다.here,

silver

the output of the layer

is the input of the layer,

is a composite function,

is from 0

It means that the output feature map up to the layer is connected.

composite function은 도 6에 도시된 바와 같이 BN(batch normalization), ReLU(rectified linear unit), 3 x 3 커널의 convolution의 연속으로 구성된다. As shown in FIG. 6 , the composite function consists of a continuation of batch normalization (BN), rectified linear unit (ReLU), and convolution of a 3×3 kernel.

한편, Dilated block은 스펙트로그램에서 효과적으로 수용 범위를 늘리기 위한 것으로 Dense block 앞 단에 배치되어 Dense block과 함께 DDB를 형성한다. 스펙트로그램에서 시간축은 발화 속도에 영향을 받고, 주파수축은 성별에 따른 피치(pitch), 하모닉 등으로 시간축과 독립적인 원인으로 변화한다. On the other hand, the dilated block is to effectively increase the reception range in the spectrogram, and it is placed in front of the dense block to form a DDB together with the dense block. In the spectrogram, the time axis is affected by the firing rate, and the frequency axis changes with a cause independent of the time axis such as pitch and harmonic according to gender.

따라서, 다시 도 4를 참조하면, Dilated block은 BN, ReLU, 다음에 FD Conv(Frequency Dilated convolution)과 TD Conv(Time Dilated convolution)이 병렬로 배치된 구조를 갖는다.Therefore, referring again to FIG. 4 , the dilated block has a structure in which BN, ReLU, followed by FD Conv (Frequency Dilated Convolution) and TD Conv (Time Dilated Convolution) are arranged in parallel.

또한, 도 7의 (a)에 도시된 바와 같이 FD convolution은 [2, 1] dilation rate의 5 x 3커널을 가져 주파수축으로 수용 범위를 늘리는 역할을 하고, 도 7의 (b)에 도시된 바와 같이 TD convolution은 [1, 2] dilation rate의 3 x 5커널을 가져 시간축으로 수용 범위를 늘리는 역할을 한다.In addition, as shown in (a) of FIG. 7, the FD convolution has a 5 x 3 kernel with a [2, 1] dilation rate and serves to increase the reception range along the frequency axis, and is shown in (b) of FIG. As shown, the TD convolution has a 3 x 5 kernel with a dilation rate of [1, 2] and serves to increase the acceptance range along the time axis.

이와 같이 DDB의 각 층에서 k개의 특징맵이 출력되고 모두 연결되는 구조를 가지므로 DDB의 최종 출력 특징맵의 개수

는 하기의 식 2와 같이 표현될 수 있다.As described above, since k feature maps are output from each layer of the DDB and all are connected, the number of final output feature maps of the DDB is

can be expressed as in Equation 2 below.

<식 2><Equation 2>

여기서 m₀는 입력 특징맵의 개수를 나타내고, L은 composite function의 개수를 의미하며, k는 growth rate를 의미한다.Here, m ₀ denotes the number of input feature maps, L denotes the number of composite functions, and k denotes the growth rate.

한편, DDB에서 출력된 특징맵은 compression block에 의해 특징맵의 정보가 압축되어 특징맵의 개수가 감축된다. 도 8에 도시된 바와 같이 compression block은 BN, ReLU 및 1 x 1 커널의 convolution으로 구성된다. compression block에 의해 압축되는 압축률은

로 표현될 수 있고,

은 0에서 1사이의 값을 갖는다. 이에 따라 compression block을 통과한 최종 출력 특징맵의 개수는

이다.Meanwhile, in the feature map output from the DDB, the information of the feature map is compressed by the compression block, so that the number of feature maps is reduced. As shown in FIG. 8, the compression block is composed of a convolution of BN, ReLU, and 1 x 1 kernel. The compression rate compressed by the compression block is

can be expressed as

has a value between 0 and 1. Accordingly, the number of final output feature maps that have passed through the compression block is

to be.

도 9는 LSTM block의 세부적인 구조를 나타낸다. LSTM block은 1 x 1 커널의 convolution과 BLSTM(bi-directional LSTM) 및 FCNN(fully connected neural network)로 구성되며, 입력은 출력과 연결된다. 여기서, BLSTM은 RNN(Recurrent Neural Network)구조이므로 LSTM block은 CNN(convolutional Neural Network)모델에서 CRNN(Convolutional Recurrent Neural Network) 모델로 동작한다. 또한, BLSTM을 통해 LSTM block은 학습 스펙트로그램의 시간축을 따라 시계열 정보를 이용한 특징맵의 특징을 출력한다.9 shows a detailed structure of an LSTM block. The LSTM block consists of a 1 x 1 kernel convolution, BLSTM (bi-directional LSTM) and FCNN (fully connected neural network), and the input is connected to the output. Here, since the BLSTM is a Recurrent Neural Network (RNN) structure, the LSTM block operates from a Convolutional Neural Network (CNN) model to a Convolutional Recurrent Neural Network (CRNN) model. In addition, through the BLSTM, the LSTM block outputs the features of the feature map using time series information along the time axis of the learning spectrogram.

또한, 타겟 분리 모델을 학습하기 위해서는 목적 함수가 필요하다. 상술한 파라미터들의 값은 목적 함수가 0이 되도록 반복적으로 학습된다. 목적함수는 하기의 식 3으로 표현될 수 있다.In addition, an objective function is required to train the target separation model. The values of the above-described parameters are iteratively learned so that the objective function becomes zero. The objective function may be expressed by Equation 3 below.

<식 3><Equation 3>

여기서

는 목적 함수, Y는 정답 신호의 스펙트로그램의 크기, X는 학습 스펙트로그램, M은 타겟 분리 모델에서 산출된 마스크 필터,

는 요소별 곱(element-wise multiplication),

은 행렬의 각 요소들의 절대값을 취하여 더한 값을 의미한다. here

is the objective function, Y is the size of the spectrogram of the correct answer signal, X is the training spectrogram, M is the mask filter calculated from the target separation model,

is an element-wise multiplication,

denotes a value added by taking the absolute values of each element of the matrix.

도 10은 본 명세서의 일 실시예에 따른 오디오 소스 분리 방법의 순서도이다.10 is a flowchart of a method for separating an audio source according to an embodiment of the present specification.

도면을 참조하면, 오디오 소스 분리 장치(100)의 데이터 입력부(110)는 잡음이 포함되지 않은 타겟 데이터와 잡음이 포함된 잡음 데이터를 포함하는 학습 데이터를 입력 받는다(S110).Referring to the drawing, the data input unit 110 of the apparatus 100 for separating an audio source receives training data including target data without noise and noise data including noise ( S110 ).

이후, 모델 학습부(120)는 학습 데이터에 기반하는 학습 스펙트로그램에 대하여 복수의 주파수 밴드 각각에 대응되는 특징맵을 결합하여 타겟 분리 모델을 학습한다(S120).Thereafter, the model learning unit 120 learns a target separation model by combining a feature map corresponding to each of a plurality of frequency bands with respect to the learning spectrogram based on the training data ( S120 ).

이때, 복수의 주파수 밴드는 Low band, Middle band, High band 및 Full band를 포함하며, 타겟 분리 모델은 DenseNet구조에 기반하여 적어도 하나의 compression block, 적어도 하나의 DDB(Dilated Dense Block) 및 적어도 하나의 LSTM block(Long-Short Term Memory block)을 이용하여 학습된다.In this case, the plurality of frequency bands includes a low band, a middle band, a high band and a full band, and the target separation model is based on the DenseNet structure at least one compression block, at least one DDB (Dilated Dense Block) and at least one It is learned using an LSTM block (Long-Short Term Memory block).

타겟 분리 모델의 학습이 완료되면 신호 추출부(130)는 타겟 분리 모델로부터 출력된 마스크 필터 및 입력 신호를 이용하여 타겟 신호를 추출함으로써 오디오 소스를 분리한다(S130).When the learning of the target separation model is completed, the signal extraction unit 130 separates the audio source by extracting the target signal using the mask filter output from the target separation model and the input signal ( S130 ).

이상과 같이 본 발명에 대해서 예시한 도면을 참조로 하여 설명하였으나, 본 명세서에 개시된 실시 예와 도면에 의해 본 발명이 한정되는 것은 아니며, 본 발명의 기술사상의 범위 내에서 통상의 기술자에 의해 다양한 변형이 이루어질 수 있음은 자명하다. 아울러 앞서 본 발명의 실시 예를 설명하면서 본 발명의 구성에 따른 작용 효과를 명시적으로 기재하여 설명하지 않았을지라도, 해당 구성에 의해 예측 가능한 효과 또한 인정되어야 함은 당연하다.As described above, the present invention has been described with reference to the illustrated drawings, but the present invention is not limited by the embodiments and drawings disclosed in the present specification. It is obvious that variations can be made. In addition, although the effect of the configuration of the present invention has not been explicitly described and described while describing the embodiment of the present invention, it is natural that the effect predictable by the configuration should also be recognized.

Claims

receiving training data including target data not including noise and noise data including noise;
learning a target separation model by combining a feature map corresponding to each of a plurality of frequency bands with respect to a learning spectrogram based on the training data; and
Extracting a target signal using a mask filter output from the target separation model and an input signal,
How to separate audio sources.

According to claim 1,
The plurality of frequency bands
low band, middle band, high band and full band
How to separate audio sources.

According to claim 1,
The step of learning the target separation model is
Based on the DenseNet structure, at least one compression block, at least one DDB (Dilated Dense Block), and at least one LSTM block (Long-Short Term Memory block) comprising learning a target separation model using
How to separate audio sources.

4. The method of claim 3,
The DDB includes a Dilated block and a Dense block,
The dilated block includes FD Convolution (Frequency Dilated Convolution) and TD Convolution (Time Dilated Convolution) arranged in parallel.
How to separate audio sources.

According to claim 1,
The step of learning the target separation model is
transforming each of the target data and the mixed data in which the target data and the noise data are mixed into a first learning spectrogram and a second learning spectrogram through a Short Term Fourier Transform (STFT);
calculating sizes of the first and second learning spectrograms using absolute values;
and learning the target separation model by inputting the calculated sizes of the first learning spectrogram and the second learning spectrogram as input.
How to separate audio sources.

According to claim 1,
The step of extracting the target signal
converting the input signal into an input spectrogram through a short-time Fourier transform;
calculating a size of an input spectrogram that is an absolute value of the input spectrogram;
obtaining a mask filter from the target separation model using the size of the input spectrogram;
generating a corrected spectrogram by multiplying the input spectrogram and the mask filter and applying phase information of the input spectrogram;
performing an inverse short-time Fourier transform (ISTFT, Inverse STFT) on the generated modified spectrogram to extract a target signal
How to separate audio sources.

a data input unit for receiving training data including target data without noise and noise data including noise;
a model learning unit for learning a target separation model by combining a feature map corresponding to each of a plurality of frequency bands with respect to a learning spectrogram based on the training data; and
A signal extraction unit for extracting a target signal using the mask filter output from the target separation model and the input signal,
Audio source separation device.

8. The method of claim 7,
The plurality of frequency bands
low band, middle band, high band and full band
Audio source separation device.

8. The method of claim 7,
The model learning unit
Based on the DenseNet structure, at least one compression block, at least one DDB (Dilated Dense Block), and at least one LSTM block (Long-Short Term Memory block) comprising learning a target separation model using
Audio source separation device.

10. The method of claim 9,
The DDB includes a Dilated block and a Dense block,
The dilated block includes FD Convolution (Frequency Dilated Convolution) and TD Convolution (Time Dilated Convolution) arranged in parallel.
Audio source separation device.

8. The method of claim 7,
The model learning unit
Each of the target data and the mixed data in which the target data and the noise data are mixed is transformed into a first learning spectrogram and a second learning spectrogram through Short Term Fourier Transform (STFT), and the absolute value is used to calculate the size of each of the first learning spectrogram and the second learning spectrogram, and learn the target separation model by using the calculated sizes of the first learning spectrogram and the second learning spectrogram as input
Audio source separation device.

8. The method of claim 7,
The signal extraction unit
The input signal is converted into an input spectrogram through a short-time Fourier transform, the size of the input spectrogram that is the absolute value of the input spectrogram is calculated, and a mask filter is formed from the target separation model using the size of the input spectrogram. obtaining, multiplying the input spectrogram and the mask filter, and applying the phase information of the input spectrogram to generate a modified spectrogram, and perform an inverse short-time Fourier transform (ISTFT, Inverse STFT) on the generated modified spectrogram to extract the target signal
Audio source separation device.