KR20210107943A

KR20210107943A - Method and Device for Detecting Sound Source

Info

Publication number: KR20210107943A
Application number: KR1020200022114A
Authority: KR
Inventors: 김동원; 윤종길
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2020-02-24
Filing date: 2020-02-24
Publication date: 2021-09-02
Also published as: KR102386186B1

Abstract

Disclosed are a method and a device for processing audio data. According to an aspect of the present invention, provided is a method for processing audio data comprising the steps of: converting input audio data and reference audio data respectively into an input spectrogram and a reference spectrogram; normalizing the reference spectrogram and detecting a target spectrogram, which is a region with the highest similarity with the normalized reference spectrogram in the input spectrogram; generating a power ratio vector, which is the ratio of the power of the target spectrogram to the reference spectrogram; adjusting the size of the reference spectrogram for each time component using the power ratio vector; generating a difference spectrogram based on the target spectrogram and the adjusted reference spectrogram; and generating output audio data based on the difference spectrogram. Therefore, the reference audio data can be detected accurately and deleted from synthesized audio data.

Description

Audio data processing apparatus and method {Method and Device for Detecting Sound Source}

본 발명의 실시예들은 여러 오디오 데이터가 혼합된 오디오 데이터에서 특정 오디오 데이터를 제거하고, 다른 오디오 데이터를 추가하는 오디오 데이터 처리 장치 및 방법에 관한 것이다.Embodiments of the present invention relate to an audio data processing apparatus and method for removing specific audio data from audio data mixed with various audio data and adding other audio data.

이 부분에 기술된 내용은 단순히 본 발명에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information on the present invention and does not constitute the prior art.

영상을 제작할 때 복수의 카메라를 이용하여 촬영을 한 후 촬영된 영상에 타이틀(title), 로고(logo), 캡션(caption), 배경음악(BGM), 효과음 등이 추가된다. 마스터링(mastering)에 이용되는 원본 영상, 자막, 타이틀, 로고, 배경음악, 및 효과음은 저장 공간 등의 문제로 별도로 보관되지 않는다. 이로 인해, 마스터링된 오디오 데이터에 포함된 배경음악을 변경하거나 목소리를 변경하는 과정이 쉽지 않다는 문제점이 있다. 또한, 마스터링된 영상을 해외 수출하기 위해서 국가에 따라 영상에 사용된 배경음악(BGM)의 라이선스(license)로 인하여 저작료를 지불해야 하거나, 아예 저작권 문제로 수출을 하지 못하는 경우가 발생한다.When a video is produced, a title, logo, caption, background music (BGM), sound effects, etc. are added to the recorded video after shooting using a plurality of cameras. Original images, subtitles, titles, logos, background music, and sound effects used for mastering are not stored separately due to storage space issues. For this reason, there is a problem in that it is not easy to change the background music or change the voice included in the mastered audio data. In addition, in order to export the mastered video overseas, depending on the country, it is necessary to pay a copyright fee due to the license of the background music (BGM) used in the video, or export may not be possible due to copyright issues.

따라서, 마스터링된 오디오 데이터에서 특정 오디오 데이터만 제거하는 기술에 대한 연구가 필요하다.Therefore, it is necessary to study a technique for removing only specific audio data from the mastered audio data.

이를 위해, 신경망을 이용하는 오디오 데이터 처리 방법이 연구되고 있다. 훈련된 신경망에 합성 오디오 데이터를 입력하고, 복수의 분리된 오디오 데이터를 획득하는 기술이 있다. 예를 들어, 사람의 목소리와 배경 음악이 합성된 합성 오디오 데이터를 훈련된 신경망에 입력하면, 특정한 사람의 목소리를 제외한 다른 소리가 제거된 오디오 데이터를 얻을 수 있다. 여기서, 신경망은 CNN(convolution neural network), RNN(recurrent neural network), Auto Encoder, U-Net, LSTM(Long-Short Term Memory), GAN(generative and adversarial network) 등 다양한 신경망 중 하나일 수 있다.To this end, an audio data processing method using a neural network is being studied. There is a technique for inputting synthetic audio data into a trained neural network and acquiring a plurality of separated audio data. For example, if synthetic audio data in which a human voice and background music are synthesized is input to a trained neural network, audio data in which sounds other than a specific human voice are removed can be obtained. Here, the neural network may be one of various neural networks, such as a convolution neural network (CNN), a recurrent neural network (RNN), an auto encoder, a U-Net, a long-short term memory (LSTM), and a generative and adversarial network (GAN).

이 외에도, 여러 오디오 데이터가 합성된 합성 오디오 데이터를 분리하는 방법이 연구되고 있으며, 이는 시간-주파수 영역에서 합성 오디오 데이터를 분리하는 방법과 시간 영역에서 합성 오디오 데이터를 분리하는 방법으로 나뉜다. 분리 정확도를 높이기 위해, 두 방법 중 합성 오디오 데이터를 시간-주파수 영역에서 분리하는 방법이 많이 이용된다. 구체적으로, 시간 영역의 합성 오디오 데이터와 제거하려는 기준 오디오 데이터를 시간-주파수 영역으로 변환한 후 합성 데이터에서 기준 오디오 데이터를 제거하고, 기준 오디오 데이터가 제거된 오디오 데이터를 다시 시간 영역으로 변환함으로써 합성 오디오 데이터에서 기준 오디오 데이터를 제거할 수 있다.In addition, a method for separating synthesized audio data in which several audio data is synthesized is being studied, which is divided into a method of separating synthesized audio data in a time-frequency domain and a method of separating synthesized audio data in a time domain. In order to increase separation accuracy, a method of separating synthesized audio data in a time-frequency domain is widely used among the two methods. Specifically, synthesized by converting the synthesized audio data in the time domain and the reference audio data to be removed into the time-frequency domain, removing the reference audio data from the synthesized data, and converting the audio data from which the reference audio data is removed back into the time domain Reference audio data may be removed from the audio data.

하지만, 합성 오디오 데이터에서 기준 오디오 데이터를 제거하기 위해서, 합성 오디오 데이터 내 기준 오디오 데이터가 이용된 구간을 정확하게 검출하고, 기준 오디오 데이터와 관련도가 높은 부분만을 제거해야한다. 또한, 합성 오디오 데이터에서 기준 오디오 데이터를 제거하기 위해, 합성 오디오 데이터의 크기를 고려하여 기준 오디오 데이터의 크기를 조정할 필요가 있다. 예를 들어, 합성 오디오 데이터에서 제거하고자 하는 구간에 페이딩 아웃(fading out: 시간에 따른 볼륨의 변화)이 적용된 경우, 오디오 데이터를 처리하는 데 볼륨의 변화까지 고려하여야한다.However, in order to remove the reference audio data from the synthesized audio data, it is necessary to accurately detect a section in which the reference audio data is used in the synthesized audio data and to remove only a portion having a high degree of relevance to the reference audio data. In addition, in order to remove the reference audio data from the synthesized audio data, it is necessary to adjust the size of the reference audio data in consideration of the size of the synthesized audio data. For example, when fading out (change in volume over time) is applied to a section to be removed from synthesized audio data, a change in volume should be considered in processing the audio data.

마스터링된 합성 오디오 데이터는 목소리 및 잡음 등에 의해 합성 오디오 데이터에 사용된 배경음악의 볼륨의 변화를 감지할 때, 종래에는 사람이 직접 배경음악의 볼륨의 변화를 듣고, 배경음악 구간을 결정하기 때문에 배경음악의 볼륨을 정확히 측정하기 어렵다는 문제점이 있다.When the mastered synthesized audio data detects a change in the volume of the background music used in the synthesized audio data due to voice and noise, in the prior art, a person directly listens to the change in the volume of the background music and determines the background music section. There is a problem in that it is difficult to accurately measure the volume of the background music.

따라서, 합성 오디오 데이터에서 제거하고자 하는 기준 오디오 데이터의 구간 및 크기를 고려하여, 합성 오디오 데이터로부터 기준 오디오 데이터를 제거하는 방법에 대한 연구가 필요하다.Therefore, it is necessary to study a method for removing reference audio data from synthetic audio data in consideration of the section and size of reference audio data to be removed from synthetic audio data.

본 발명의 실시예들은, 합성 오디오 데이터에서 다른 오디오 데이터에 대한 영향을 최소화한 채로 기준 오디오 데이터를 제거하되, 기준 오디오 데이터가 포함된 구간과 기준 오디오 데이터의 크기(시간에 따른 볼륨의 변화량 포함)를 고려하여 합성 오디오 데이터로부터 기준 오디오 데이터를 정확하게 검출 및 제거하는 오디오 데이터 처리 장치 및 방법을 제공하는 데 주된 목적이 있다.Embodiments of the present invention remove the reference audio data from the synthesized audio data while minimizing the influence on other audio data, but include the section including the reference audio data and the size of the reference audio data (including the amount of change in volume over time) An object of the present invention is to provide an audio data processing apparatus and method for accurately detecting and removing reference audio data from synthesized audio data in consideration of the present invention.

본 발명의 다른 실시예들은, 다른 오디오 데이터를 합성 오디오 데이터에 합성하기 위해, 합성 오디오 데이터에서 기준 오디오 데이터를 검출 및 제거하는데 이용된 정보들을 이용하는 오디오 데이터 처리 장치 및 방법을 제공하는 데 일 목적이 있다.Another aspect of the present invention is to provide an audio data processing apparatus and method for using information used to detect and remove reference audio data from synthetic audio data to synthesize other audio data into synthetic audio data. have.

본 발명의 일 측면에 의하면, 입력 오디오 데이터에서 기준 오디오 데이터를 제거 또는 교체하는 오디오 데이터 처리 방법에 있어서, 시간 영역으로 표현되는 입력 오디오 데이터와 기준 오디오 데이터를 시간-주파수 영역으로 표현되는 입력 스펙트로그램(spectrogram)과 기준 스펙트로그램으로 각각 변환하는 과정; 상기 기준 스펙트로그램을 정규화하는 과정; 상기 입력 스펙트로그램 내 정규화된 기준 스펙트로그램과의 유사도가 가장 높은 영역인 대상 스펙트로그램을 검출하는 과정; 상기 기준 스펙트로그램의 시간 성분별 전력에 대한 상기 대상 스펙트로그램의 시간 성분별 전력의 비인 전력비 벡터를 생성하는 과정; 상기 기준 오디오 데이터 중 상기 입력 오디오 데이터에 이용된 부분을 강조하기 위해, 상기 전력비 벡터를 이용하여 상기 기준 스펙트로그램의 시간 성분별 크기를 조정하는 과정; 상기 대상 스펙트로그램과 조정된 기준 스펙트로그램에 기반하여 차분 스펙트로그램을 생성하는 과정; 및 상기 차분 스펙트로그램에 기반하여 출력 오디오 데이터를 생성하는 과정을 포함하는 오디오 데이터 처리 방법을 제공한다.According to one aspect of the present invention, in an audio data processing method for removing or replacing reference audio data from input audio data, input audio data expressed in a time domain and reference audio data expressed in a time-frequency domain are input spectrograms expressed in a time-frequency domain. (spectrogram) and the process of converting to a reference spectrogram, respectively; normalizing the reference spectrogram; detecting a target spectrogram that is a region having the highest similarity with a normalized reference spectrogram in the input spectrogram; generating a power ratio vector that is a ratio of power for each time component of the target spectrogram to power for each time component of the reference spectrogram; adjusting the size of each time component of the reference spectrogram by using the power ratio vector to emphasize a portion of the reference audio data used for the input audio data; generating a differential spectrogram based on the target spectrogram and the adjusted reference spectrogram; and generating output audio data based on the differential spectrogram.

본 실시예의 다른 측면에 의하면, 입력 오디오 데이터에서 기준 오디오 데이터를 제거 또는 교체하는 오디오 데이터 처리 장치에 있어서, 명령어들(instructions)을 저장하는 적어도 하나의 메모리; 및 상기 메모리에 저장된 적어도 하나의 명령어를 실행함으로써, 오디오 데이터를 처리하는 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 메모리는 상기 적어도 하나의 프로세서를 통해, 시간 영역으로 표현되는 입력 오디오 데이터와 기준 오디오 데이터를 시간-주파수 영역으로 표현되는 입력 스펙트로그램과 기준 스펙트로그램으로 각각 변환하고, 상기 기준 스펙트로그램을 정규화하고, 상기 입력 스펙트로그램 내 정규화된 기준 스펙트로그램과의 유사도가 가장 높은 영역인 대상 스펙트로그램을 검출하고, 상기 기준 스펙트로그램의 시간 성분별 전력에 대한 상기 대상 스펙트로그램의 시간 성분별 전력의 비인 전력비 벡터를 생성하고, 상기 기준 오디오 데이터 중 상기 입력 오디오 데이터에 이용된 부분을 강조하기 위해, 상기 전력비 벡터를 이용하여 상기 기준 스펙트로그램의 시간 성분별 크기를 조정하고, 상기 대상 스펙트로그램과 조정된 기준 스펙트로그램에 기반하여 차분 스펙트로그램을 생성하고, 상기 차분 스펙트로그램에 기반하여 출력 오디오 데이터를 생성하도록 설정된 오디오 데이터 처리 장치를 제공한다.According to another aspect of this embodiment, there is provided an audio data processing apparatus for removing or replacing reference audio data from input audio data, comprising: at least one memory for storing instructions; and at least one processor for processing audio data by executing at least one instruction stored in the memory, wherein the at least one memory provides input audio data expressed in a time domain and a reference through the at least one processor. The audio data is converted into an input spectrogram and a reference spectrogram expressed in a time-frequency domain, respectively, the reference spectrogram is normalized, and the target spectrogram is an area having the highest similarity to the normalized reference spectrogram in the input spectrogram. To detect a gram, generate a power ratio vector that is a ratio of power for each temporal component of the target spectrogram to power for each temporal component of the reference spectrogram, and to emphasize a portion of the reference audio data used for the input audio data , adjusts the size of each time component of the reference spectrogram using the power ratio vector, generates a differential spectrogram based on the target spectrogram and the adjusted reference spectrogram, and outputs audio data based on the differential spectrogram An audio data processing device configured to generate

이상에서 설명한 바와 같이 본 발명의 일 실시예에 의하면, 합성 오디오 데이터에서 기준 오디오 데이터를 제거하되, 기준 오디오 데이터가 포함된 구간과 기준 오디오 데이터의 크기(시간에 따른 볼륨의 변화량 포함)를 고려하여 합성 오디오 데이터로부터 기준 오디오 데이터를 정확하게 검출하고, 다른 오디오 데이터에 대한 영향을 최소화한 채로 기준 오디오 데이터를 제거할 수 있다.As described above, according to an embodiment of the present invention, the reference audio data is removed from the synthesized audio data, but considering the section including the reference audio data and the size of the reference audio data (including the amount of change in volume over time) It is possible to accurately detect the reference audio data from the synthesized audio data and remove the reference audio data while minimizing the influence on other audio data.

이상에서 설명한 바와 같이 본 발명의 다른 실시예에 의하면, 합성 오디오 데이터에서 기준 오디오 데이터를 검출 및 제거하는데 이용된 정보들을 이용하여 다른 오디오 데이터를 합성 오디오 데이터에 최적으로 합성할 수 있다.As described above, according to another embodiment of the present invention, other audio data can be optimally synthesized into the synthesized audio data using information used to detect and remove the reference audio data from the synthesized audio data.

도 1은 입력 오디오 데이터에 기준 오디오 데이터 전부가 포함된 경우, 두 데이터 간 유사도를 측정하는 과정을 설명하기 위해 예시한 도면이다.
도 2는 입력 오디오 데이터에 기준 오디오 데이터의 일부만 포함된 경우, 두 데이터 간 유사도를 측정하는 과정을 설명하기 위해 예시한 도면이다.
도 3은 본 발명의 일 실시예에 따른 오디오 데이터 처리 방법을 설명하기 위해 예시한 도면이다.
도 4는 본 발명의 일 실시예에 따른 전력비 벡터를 생성하는 과정을 설명하기 위한 순서도다.
도 5는 본 발명의 일 실시예에 따른 오디오 데이터 처리 과정을 설명하기 위한 순서도다.FIG. 1 is a diagram illustrating a process of measuring a degree of similarity between two pieces of data when all of reference audio data is included in input audio data.
FIG. 2 is a diagram illustrating a process of measuring a similarity between two data when input audio data includes only a part of reference audio data.
3 is a diagram illustrating an audio data processing method according to an embodiment of the present invention.
4 is a flowchart illustrating a process of generating a power ratio vector according to an embodiment of the present invention.
5 is a flowchart illustrating an audio data processing process according to an embodiment of the present invention.

이하, 본 발명의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present invention will be described in detail with reference to exemplary drawings. In adding reference numerals to the components of each drawing, it should be noted that the same components are given the same reference numerals as much as possible even though they are indicated on different drawings. In addition, in describing the present invention, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present invention, the detailed description thereof will be omitted.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 명세서 전체에서, 어떤 부분이 어떤 구성요소를 '포함', '구비'한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 '~부', '모듈' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.In addition, in describing the components of the present invention, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the components from other components, and the essence, order, or order of the components are not limited by the terms. Throughout the specification, when a part 'includes' or 'includes' a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. . In addition, terms such as '~ unit' and 'module' described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software.

이하에서, 오디오 데이터는 시간에 따라 음파의 진폭을 벡터로 변환한 데이터를 의미하고, 스펙트로그램(spectrogram)은 오디오 데이터를 시간-주파수 영역으로 변환한 2차원 데이터를 의미한다. 스펙트로그램의 가로축은 시간을 의미하고, 세로축은 주파수를 의미한다. 스펙트로그램의 원소(element)는 빈(bin)과 같은 단어다.Hereinafter, audio data refers to data obtained by converting the amplitude of a sound wave into a vector according to time, and a spectrogram refers to two-dimensional data obtained by converting audio data into a time-frequency domain. The horizontal axis of the spectrogram means time, and the vertical axis means frequency. An element in a spectrogram is the same word as a bin.

도 1(a) 및 도 1(b)는 본 발명의 일 실시예에 따라 입력 오디오 데이터에 기준 오디오 데이터 전부가 포함된 경우, 두 데이터 간 유사도를 측정하는 과정을 설명하기 위해 예시한 도면이다.1A and 1B are diagrams for explaining a process of measuring the similarity between two data when all of the reference audio data is included in the input audio data according to an embodiment of the present invention.

도 1(a) 및 도 1(b)를 참조하면, 입력 스펙트로그램(100), 대상 스펙트로그램(102) 및 기준 스펙트로그램(110)이 도시되어 있다.Referring to FIGS. 1A and 1B , an input spectrogram 100 , a target spectrogram 102 , and a reference spectrogram 110 are illustrated.

입력 스펙트로그램(100)은 기준 오디오 데이터와 다른 오디오 데이터가 혼합된 입력 오디오 데이터를 시간-주파수 영역으로 변환한 2차원 데이터다.The input spectrogram 100 is two-dimensional data obtained by converting input audio data in which reference audio data and other audio data are mixed into a time-frequency domain.

기준 스펙트로그램(110)은 입력 오디오 데이터에서 기준 오디오 데이터와 유사한 부분을 제거하기 위해 필요한 기준 오디오 데이터를 시간-주파수 영역으로 변환한 데이터다. 본 발명의 일 실시예에 따라 기준 스펙트로그램(110)은 정규화될 수 있다.The reference spectrogram 110 is data obtained by converting the reference audio data necessary for removing a portion similar to the reference audio data from the input audio data into a time-frequency domain. According to an embodiment of the present invention, the reference spectrogram 110 may be normalized.

대상 스펙트로그램(102)은 입력 스펙트로그램(100) 내에서 기준 스펙트로그램(110)과의 유사도가 가장 높은 영역이다.The target spectrogram 102 is an area in the input spectrogram 100 with the highest similarity to the reference spectrogram 110 .

오디오 데이터 처리 장치(미도시)는 대상 스펙트로그램(102)을 검출하기 위해, 입력 스펙트로그램(100)의 첫 번째 시간 성분과 기준 스펙트로그램(110)의 첫 번째 시간 성분을 매칭하고, 입력 스펙트로그램(100)과 기준 스펙트로그램(110)의 원소별 곱의 합을 통해 제1 유사도(similarity_1)를 검출한다.The audio data processing apparatus (not shown) matches the first time component of the input spectrogram 100 with the first time component of the reference spectrogram 110 to detect the target spectrogram 102 , and the input spectrogram A first similarity (similarity_1) is detected through the sum of (100) and the product of each element of the reference spectrogram 110 .

오디오 데이터 처리 장치는 기준 스펙트로그램(110)을 시간 성분 단위로 이동시키고, 이동된 기준 스펙트로그램(110)에 대응되는 입력 스펙트로그램(100)과 기준 스펙트로그램(110) 간 유사도를 계산한다.The audio data processing apparatus moves the reference spectrogram 110 in units of time components, and calculates a similarity between the input spectrogram 100 corresponding to the moved reference spectrogram 110 and the reference spectrogram 110 .

도 1(b)를 참조하면, 입력 스펙트로그램(100)과 기준 스펙트로그램(110) 간 유사도가 가장 높은 구간인 대상 스펙트로그램(102)의 위치가 도시되어 있다. 대상 스펙트로그램(102)과 기준 스펙트로그램 간 유사도를 제2 유사도(similarity_2)라고 할 때, 제2 유사도는 제1 유사도보다 크다.Referring to FIG. 1B , the position of the target spectrogram 102 , which is a section having the highest similarity between the input spectrogram 100 and the reference spectrogram 110 , is shown. When the similarity between the target spectrogram 102 and the reference spectrogram is referred to as a second degree of similarity (similarity_2), the second degree of similarity is greater than the first degree of similarity.

본 발명의 일 실시예에 따라 유사도를 구하기 위해, 오디오 데이터 처리 장치는 수학식 1, 수학식 2 및 수학식 3 중 적어도 어느 하나를 이용할 수 있다.In order to obtain the similarity according to an embodiment of the present invention, the audio data processing apparatus may use at least one of Equation 1, Equation 2, and Equation 3.

수학식 1, 수학식2 및 수학식 3에서 S는 유사도, i는 입력 스펙트로그램(100)의 각 시간 성분 인덱스, t는 기준 스펙트로그램(110)의 총 시간 길이, f는 스펙트로그램의 각 시간 성분에 대한 총 주파수 길이, isound는 입력 스펙트로그램(100), rsound는 기준 스펙트로그램(110)을 의미한다.

는 기준 스펙트로그램(110)의 각 시간 성분에 대한 L2-norm을 의미하는 벡터로서, 수학식 4를 통해 계산된다.In Equations 1, 2 and 3, S is the degree of similarity, i is each time component index of the input spectrogram 100, t is the total time length of the reference spectrogram 110, and f is each time of the spectrogram. The total frequency length of the component, isound means the input spectrogram 100, and rsound means the reference spectrogram 110.

is a vector indicating L2-norm for each time component of the reference spectrogram 110, and is calculated through Equation (4).

도 2(a) 및 도 2(b)는 본 발명의 일 실시예에 따라 입력 오디오 데이터에 기준 오디오 데이터의 일부만 포함된 경우, 두 데이터 간 유사도를 측정하는 과정을 설명하기 위해 예시한 도면이다.2(a) and 2(b) are diagrams for explaining a process of measuring a similarity between two data when only a part of reference audio data is included in input audio data according to an embodiment of the present invention.

도 2(a) 및 도 2(b)를 참조하면, 제로 패딩(zero padding)된 입력 스펙트로그램(200), 기준 스펙트로그램(210) 및 대상 스펙트로그램(202)이 도시되어 있다.Referring to FIGS. 2A and 2B , a zero-padded input spectrogram 200 , a reference spectrogram 210 , and a target spectrogram 202 are illustrated.

입력 오디오 데이터가 기준 오디오 데이터 중 일부만 포함하고 있는 경우, 즉 입력 스펙트로그램에 기준 스펙트로그램의 일부 구간만 포함된 경우 오디오 데이터 처리 장치(미도시)는 유사도의 정확성을 높이기 위해 입력 스펙트로그램에 제로 패딩을 수행한다. When the input audio data includes only a part of the reference audio data, that is, when the input spectrogram includes only a partial section of the reference spectrogram, the audio data processing device (not shown) performs zero padding on the input spectrogram to increase the accuracy of the similarity. carry out

오디오 데이터 처리 장치는 입력 스펙트로그램의 전단 및 후단에 기준 스펙트로그램(210)의 시간 길이만큼 제로 패딩을 수행한다. 이때, 제로 패드의 주파수 길이, 입력 스펙트로그램의 주파수 길이, 기준 스펙트로그램(210)의 주파수 길이는 같다.The audio data processing apparatus performs zero padding on the front and rear ends of the input spectrogram for the length of time of the reference spectrogram 210 . In this case, the frequency length of the zero pad, the frequency length of the input spectrogram, and the frequency length of the reference spectrogram 210 are the same.

오디오 데이터 처리 장치는 제로 패딩된 입력 스펙트로그램(200) 내 기준 스펙트로그램(210)과의 원소별 곱의 합이 가장 큰 영역을 대상 스펙트로그램(202)로 결정한다. The audio data processing apparatus determines, as the target spectrogram 202 , a region in which the sum of the product of each element with the reference spectrogram 210 in the zero-padded input spectrogram 200 is largest.

도 2(b)와 같이 기준 스펙트로그램(210)의 전 구간이 제로 패딩된 입력 스펙트로그램(200)에 포함되지 않는 경우, 오디오 데이터 처리 장치는 제로 패딩된 입력 스펙트로그램(200)과 기준 스펙트로그램(210) 간 유사도를 측정함으로써 유사도의 정확성을 향상시킬 수 있다.As shown in FIG. 2B , when the entire section of the reference spectrogram 210 is not included in the zero-padded input spectrogram 200 , the audio data processing apparatus performs the zero-padded input spectrogram 200 and the reference spectrogram By measuring the degree of similarity between (210), the accuracy of the degree of similarity can be improved.

도 3은 본 발명의 일 실시예에 따른 오디오 데이터 처리 방법을 설명하기 위해 예시한 도면이다.3 is a diagram illustrating an audio data processing method according to an embodiment of the present invention.

도 3을 참조하면, 입력 오디오 데이터(300), 기준 오디오 데이터(310), 입력 스펙트로그램(302), 대상 스펙트로그램(304), 대상 전력 벡터(306), 조정된 대상 전력 벡터(308), 기준 오디오 데이터(310), 기준 스펙트로그램(314), 기준 전력 벡터(318), 전력비 벡터(320), 조정된 기준 스펙트로그램(330), 및 차분 스펙트로그램(340)이 도시되어 있다.Referring to FIG. 3 , input audio data 300 , reference audio data 310 , input spectrogram 302 , target spectrogram 304 , target power vector 306 , adjusted target power vector 308 , Reference audio data 310 , reference spectrogram 314 , reference power vector 318 , power ratio vector 320 , adjusted reference spectrogram 330 , and differential spectrogram 340 are shown.

이하에서, 입력 오디오 데이터(300), 입력 스펙트로그램(302) 및 차분 스펙트로그램(340)의 시간 성분 길이가 5인 것으로 설명하고, 대상 스펙트로그램(304), 대상 전력 벡터(306), 조정된 대상 전력 벡터(308), 기준 오디오 데이터(310), 기준 스펙트로그램(314), 기준 전력 벡터(318), 전력비 벡터(320) 및 조정된 기준 스펙트로그램(330)의 시간 성분 길이가 3인 것으로 설명하나, 이는 하나의 예시일 뿐이며 오디오 데이터 처리 방법은 다양한 길이의 데이터에 적용될 수 있다. 또한, 도 3에서 굵은 선 부분은 입력 오디오 데이터(300)에 이용된 기준 오디오 데이터의 일부 또는 전부를 의미하고, 점선은 입력 스펙트로그램(302)의 전체를 의미한다. 한편, 시간 성분은 스펙트로그램에서 하나의 시간 간격에 해당하는 원소들의 집합을 의미하고, 각 원소의 값은 주파수 성분의 크기를 의미한다.Hereinafter, it will be described that the time component length of the input audio data 300, the input spectrogram 302, and the differential spectrogram 340 is 5, and the target spectrogram 304, the target power vector 306, and the adjusted It is assumed that the time component length of the target power vector 308 , the reference audio data 310 , the reference spectrogram 314 , the reference power vector 318 , the power ratio vector 320 and the adjusted reference spectrogram 330 is 3 Although described, this is only an example, and the audio data processing method may be applied to data of various lengths. In addition, in FIG. 3 , a thick line indicates a part or all of the reference audio data used for the input audio data 300 , and a dotted line indicates the whole of the input spectrogram 302 . Meanwhile, the time component means a set of elements corresponding to one time interval in the spectrogram, and the value of each element means the size of the frequency component.

다시 도 3을 참조하면, 오디오 데이터 처리 장치(미도시)는 입력 오디오 데이터(300)에서 기준 오디오 데이터(310)가 포함된 부분을 제거 또는 교체하기 위해, 시간 영역으로 표현되는 입력 오디오 데이터(300)와 기준 오디오 데이터(310)를 시간-주파수 영역으로 표현되는 입력 스펙트로그램(302)과 기준 스펙트로그램(314)으로 변환한다. 본 발명의 일 실시예에 따라 FFT(Fast Fourier Transform), STFT(Short Time Fourier Transform, Chroma-STFT, Chroma-CQT Chroma-CQT(Constant-Q Transform) 등 다양한 방법이 이용될 수 있다.Referring back to FIG. 3 , the audio data processing apparatus (not shown) removes or replaces the part including the reference audio data 310 from the input audio data 300, so that the input audio data 300 expressed in the time domain ) and the reference audio data 310 are converted into an input spectrogram 302 and a reference spectrogram 314 expressed in a time-frequency domain. According to an embodiment of the present invention, various methods such as Fast Fourier Transform (FFT), Short Time Fourier Transform (STFT), Chroma-STFT, Chroma-CQT and Chroma-CQT (Constant-Q Transform) may be used.

오디오 데이터 처리 장치는 입력 스펙트로그램(302) 내 기준 스펙트로그램(314)과의 유사도가 가장 높은 영역인 대상 스펙트로그램(304)을 검출한다. 본 발명의 일 실시예에 따라, 오디오 데이터 처리 장치는 시간 영역에서 두 오디오 데이터 간 유사도를 구한 후 대상 스펙트로그램(304)의 검출에 이용할 수 있다.The audio data processing apparatus detects the target spectrogram 304 , which is an area having the highest similarity with the reference spectrogram 314 in the input spectrogram 302 . According to an embodiment of the present invention, the audio data processing apparatus may obtain a similarity between two audio data in the time domain and then use it to detect the target spectrogram 304 .

본 발명의 일 실시예에 따라 오디오 데이터 처리 장치는 기준 스펙트로그램(314)을 정규화할 수 있다. 기준 스펙트로그램(314)의 정규화는 수학식 4에 의해 계산된 L2-norm 벡터를 이용하여, 기준 스펙트로그램(314)의 각 시간 성분을 L2-norm 벡터로 나눔으로써 수행된다. 오디오 데이터 처리 장치는 입력 스펙트로그램(302) 내 정규화된 기준 스펙트로그램과의 유사도가 가장 높은 영역인 대상 스펙트로그램(304)을 검출할 수 있다. 이때, 대상 스펙트로그램(304)은 제로 패딩된 입력 스펙트로그램 내 기준 스펙트로그램(314)과의 원소별 곱의 합이 가장 큰 영역을 의미한다.According to an embodiment of the present invention, the audio data processing apparatus may normalize the reference spectrogram 314 . Normalization of the reference spectrogram 314 is performed by dividing each temporal component of the reference spectrogram 314 by the L2-norm vector using the L2-norm vector calculated by Equation (4). The audio data processing apparatus may detect the target spectrogram 304 , which is a region having the highest similarity with the normalized reference spectrogram in the input spectrogram 302 . In this case, the target spectrogram 304 means an area in which the sum of the product of each element with the reference spectrogram 314 in the zero-padded input spectrogram is the largest.

오디오 데이터 처리 장치는 기준 스펙트로그램(314)의 시간 성분별 전력에 대한 대상 스펙트로그램(304)의 시간 성분별 전력의 비인 전력비 벡터(320)를 생성한다. 오디오 데이터 처리 장치가 전력비 벡터(320)를 생성하는 과정은 도 4에서 자세히 설명한다. 전력비 벡터(320), 조정된 대상 전력 벡터(308) 및 조정된 기준 스펙트로그램(330)에서 값이 0인 요소는 기준 오디오 데이터(310)와 관련도가 낮은 부분으로서, 제거하고자 하는 부분의 경계선이 된다. 이에 대한 설명도 도 4에서 설명한다.The audio data processing apparatus generates a power ratio vector 320 that is a ratio of power for each time component of the target spectrogram 304 to the power for each time component of the reference spectrogram 314 . A process in which the audio data processing apparatus generates the power ratio vector 320 will be described in detail with reference to FIG. 4 . In the power ratio vector 320 , the adjusted target power vector 308 , and the adjusted reference spectrogram 330 , an element having a value of 0 is a portion with low relevance to the reference audio data 310 , and is a boundary line of the portion to be removed. becomes this A description of this will also be described with reference to FIG. 4 .

전력비 벡터(320)는 대상 스펙트로그램(302)의 굵은 선 부분에서 기준 스펙트로그램(314)의 굵은 선 부분만큼 제거하는 데 이용된다. 전력비 벡터(320)는 입력 오디오 데이터(300)에 이용된 기준 오디오 데이터(310)의 볼륨 크기를 포함할 뿐만 아니라 볼륨의 변화량에 대한 정보도 포함한다. 이로 인해, 배경음악의 페이딩 아웃(fading out)과 같이 입력 오디오 데이터(300) 내에서 기준 오디오 데이터(310)의 볼륨이 변화하더라도 다른 오디오에 영향을 주지 않고 기준 오디오 데이터(310)만 제거할 수 있다. 또한, 전력비 벡터(320)는 입력 오디오 데이터(300)에 추가할 추가 오디오 데이터를 조정하는 데 이용될 수 있다. 예를 들어, 기준 오디오 데이터(310)가 페이딩 아웃으로 입력 오디오 데이터(300)에 합성된 경우, 추가 오디오 데이터도 기준 오디오 데이터(310)의 페이딩 아웃 효과가 동일하게 적용될 수 있다. 한편, 본 발명의 일 실시예에 따라 전력비 벡터(320)를 생성하는 데 정규화된 기준 스펙트로그램이 이용될 수 있다.The power ratio vector 320 is used to remove as much as the thick line portion of the reference spectrogram 314 from the thick line portion of the target spectrogram 302 . The power ratio vector 320 not only includes the volume size of the reference audio data 310 used for the input audio data 300 , but also includes information on the amount of change in volume. Due to this, even if the volume of the reference audio data 310 in the input audio data 300 is changed like fading out of the background music, only the reference audio data 310 can be removed without affecting other audio. have. In addition, the power ratio vector 320 may be used to adjust additional audio data to be added to the input audio data 300 . For example, when the reference audio data 310 is synthesized with the input audio data 300 by fading out, the fading-out effect of the reference audio data 310 may be applied equally to the additional audio data. Meanwhile, a normalized reference spectrogram may be used to generate the power ratio vector 320 according to an embodiment of the present invention.

오디오 데이터 처리장치는 기준 오디오 데이터(310) 중 입력 오디오 데이터(300)에 이용된 부분(굵은 선 부분)을 강조하기 위해, 전력비 벡터(320)를 이용하여 기준 스펙트로그램(314)의 시간 성분별 크기를 조정한다. 조정된 기준 스펙트로그램(330)은 대상 스펙트로그램(304) 중 기준 스펙트로그램(314)과 관련도가 높은 부분을 제거하기 위한 데이터다. 구체적으로, 오디오 데이터 처리 장치는 전력비 벡터(320)와 기준 스펙트로그램(314)의 시간 성분별 곱을 통해 조정된 기준 스펙트로그램(330)을 생성한다. 다시 말하면, 조정된 기준 스펙트로그램(330)의 각 시간 성분은 기준 스펙트로그램(314)의 각 시간 성분에 전력비 벡터(320)의 각 시간 성분을 곱한 값이다. The audio data processing apparatus uses the power ratio vector 320 to emphasize the portion (thick line portion) used for the input audio data 300 among the reference audio data 310 for each time component of the reference spectrogram 314 . Resize. The adjusted reference spectrogram 330 is data for removing a portion having a high degree of relevance to the reference spectrogram 314 from among the target spectrogram 304 . Specifically, the audio data processing apparatus generates the adjusted reference spectrogram 330 through the product of the power ratio vector 320 and the reference spectrogram 314 for each time component. In other words, each time component of the adjusted reference spectrogram 330 is a value obtained by multiplying each time component of the reference spectrogram 314 by each time component of the power ratio vector 320 .

오디오 데이터 처리 장치는 대상 스펙트로그램(304)과 조정된 기준 스펙트로그램(330)에 기반하여 차분 스펙트로그램(340)을 생성한다. 본 발명의 일 실시예에 따른 차분 스펙트로그램(340)은 대상 스펙트로그램(304)에서 조정된 기준 스펙트로그램(330)을 뺀 데이터일 수 있다.The audio data processing apparatus generates the differential spectrogram 340 based on the target spectrogram 304 and the adjusted reference spectrogram 330 . The differential spectrogram 340 according to an embodiment of the present invention may be data obtained by subtracting the adjusted reference spectrogram 330 from the target spectrogram 304 .

오디오 데이터 처리 장치는 차분 스펙트로그램(340)에 기반하여 출력 오디오 데이터를 생성한다. 결과적으로, 출력 오디오 데이터는 입력 오디오 데이터(300)에서 기준 오디오 데이터(310)에 해당하는 부분이 제거된 데이터다.The audio data processing apparatus generates output audio data based on the differential spectrogram 340 . As a result, the output audio data is data in which a portion corresponding to the reference audio data 310 is removed from the input audio data 300 .

도 4는 본 발명의 일 실시예에 따른 전력비 벡터를 생성하는 과정을 설명하기 위한 순서도다.4 is a flowchart illustrating a process of generating a power ratio vector according to an embodiment of the present invention.

도 3 및 도 4를 참조하면, 오디오 데이터 처리 장치는 대상 스펙트로그램(304)과 정규화된 기준 스펙트로그램과의 원소별 곱을 통해 대상 스펙트로그램(304)을 조정한다(S400). 이 과정을 통해, 대상 스펙트로그램(304)에서 정규화된 기준 스펙트로그램과 관련도가 낮은 노이즈가 감소한다. 즉, 대상 스펙트로그램(304)과 기준 스펙트로그램(314) 간 유사도가 증가한다.3 and 4 , the audio data processing apparatus adjusts the target spectrogram 304 through elementwise product of the target spectrogram 304 and the normalized reference spectrogram ( S400 ). Through this process, noise having a low relevance to the normalized reference spectrogram in the target spectrogram 304 is reduced. That is, the similarity between the target spectrogram 304 and the reference spectrogram 314 increases.

오디오 데이터 처리 장치는 조정된 대상 스펙트로그램 및 상기 기준 스펙트로그램에 대한 시간 성분별 전력을 나타내는 대상 전력 벡터(306) 및 기준 전력 벡터(318)를 개별적으로 생성한다(S402). 대상 전력 벡터(306) 및 기준 전력 벡터(318)는 대상 스펙트로그램(304) 및 기준 스펙트로그램(314)으로부터 수학식 5 및 수학식 6 중 적어도 어느 하나를 통해 도출된다.The audio data processing apparatus separately generates the adjusted target spectrogram and the target power vector 306 and the reference power vector 318 indicating power for each time component with respect to the reference spectrogram (S402). The target power vector 306 and the reference power vector 318 are derived from the target spectrogram 304 and the reference spectrogram 314 through at least one of Equations 5 and 6.

수학식 5 및 수학식 6에서 ivolume은 대상 전력 벡터(306), rvolume은 기준 전력 벡터(318)를 의미한다.In Equations 5 and 6, ivolume denotes a target power vector 306 and rvolume denotes a reference power vector 318 .

또한, 본 발명의 일 실시에에 따른 오디오 데이터 처리 장치는 대상 전력 벡터(306) 내 임계값보다 작은 값을 가지되, 인접 원소 중 적어도 어느 하나와 동일한 값을 가지는 하나 이상의 시퀀스를 검출한 후, 하나 이상의 시퀀스 중 길이가 가장 긴 하나의 시퀀스를 결정하며, 대상 전력 벡터(306)에서 상기 하나의 시퀀스 및 임계값보다 큰 값을 가지는 원소를 제외한 원소들의 값을 0으로 치환할 수 있다(S404, S406, S408). 값이 0인 부분은 기준 오디오 데이터(310) 중 입력 오디오 데이터(300)와 관련도가 낮은 부분이다.In addition, the audio data processing apparatus according to an embodiment of the present invention detects one or more sequences having a value smaller than a threshold value in the target power vector 306 and having the same value as at least one of adjacent elements, One sequence with the longest length is determined among one or more sequences, and values of elements excluding the one sequence and an element having a value greater than a threshold value in the target power vector 306 may be substituted with 0 (S404, S404, S406, S408). A portion having a value of 0 is a portion having a low relevance to the input audio data 300 among the reference audio data 310 .

본 발명의 다른 실시예에 따른 시퀀스가 복수인 경우, 오디오 데이터 처리 장치는 시퀀스에 해당하는 스펙트로그램과 입력 스펙트로그램 간 유사도를 계산한 후 유사도가 가장 높은 스펙트로그램에 해당하는 시퀀스를 제외한 원소들의 값을 0으로 치환할 수 있다.When there are a plurality of sequences according to another embodiment of the present invention, the audio data processing apparatus calculates the similarity between the spectrogram corresponding to the sequence and the input spectrogram, and then values elements except for the sequence corresponding to the spectrogram with the highest similarity. can be substituted with 0.

본 발명의 다른 실시예에 따라 오디오 데이터 처리 장치는 대상 전력 벡터(306)의 원소 중 임계값보다 작은 값을 가지는 원소의 값을 0으로 치환할 수 있다.According to another embodiment of the present invention, the audio data processing apparatus may substitute 0 for an element having a value smaller than a threshold value among elements of the target power vector 306 .

오디오 데이터 처리 장치는 치환된 대상 전력 벡터(308)에 대한 기준 전력 벡터(318)의 비를 전력비 벡터(320)로 결정한다(S410). 구체적으로, 전력비 벡터(320)의 각 원소는 대상 전력 벡터(308)의 각 원소에 대한 기준 전력 벡터(318)의 각 원소의 크기 비다.The audio data processing apparatus determines the ratio of the reference power vector 318 to the substituted target power vector 308 as the power ratio vector 320 (S410). Specifically, each element of the power ratio vector 320 is a ratio of the magnitude of each element of the reference power vector 318 to each element of the target power vector 308 .

오디오 데이터 처리 장치가 전력비 벡터(320)를 이용하여 기준 스펙트로그램(314)을 조정하면, 입력 오디오 데이터(300)에 이용된 기준 오디오 데이터(310)의 굵은 선 부분을 정확하게 검출 및 제거할 수 있다. 또한, 오디오 데이터 처리 장치가 출력 오디오 데이터에 다른 오디오 데이터를 추가하려는 경우, 전력비 벡터(320)를 이용한다면 출력 오디오 데이터에 대한 다른 오디오 데이터의 추가를 최적화할 수 있다.When the audio data processing apparatus adjusts the reference spectrogram 314 using the power ratio vector 320 , it is possible to accurately detect and remove a thick line portion of the reference audio data 310 used for the input audio data 300 . . Also, when the audio data processing apparatus intends to add other audio data to the output audio data, if the power ratio vector 320 is used, the addition of the other audio data to the output audio data may be optimized.

도 5는 본 발명의 일 실시예에 따른 오디오 데이터 처리 과정을 설명하기 위한 순서도다.5 is a flowchart illustrating an audio data processing process according to an embodiment of the present invention.

도 5를 참조하면, 오디오 데이터 처리 장치는 입력 오디오 데이터와 기준 오디오 데이터를 입력 스펙트로그램과 기준 스펙트로그램으로 각각 변환한다(S500).Referring to FIG. 5 , the audio data processing apparatus converts input audio data and reference audio data into an input spectrogram and a reference spectrogram, respectively ( S500 ).

오디오 데이터 처리 장치는 기준 스펙트로그램을 정규화한다(S502).The audio data processing apparatus normalizes the reference spectrogram (S502).

오디오 데이터 처리 장치는 입력 스펙트로그램 내 정규화된 기준 스펙트로그램과의 유사도가 가장 높은 영역인 대상 스펙트로그램을 검출한다(S504).The audio data processing apparatus detects a target spectrogram, which is an area having the highest similarity with the normalized reference spectrogram in the input spectrogram ( S504 ).

오디오 데이터 처리 장치는 기준 스펙트로그램의 시간 성분별 전력에 대한 대상 스펙트로그램의 시간 성분별 전력의 비인 전력비 벡터를 생성한다(S506).The audio data processing apparatus generates a power ratio vector that is a ratio of the power for each time component of the target spectrogram to the power for each time component of the reference spectrogram ( S506 ).

오디오 데이터 처리 장치는 기준 오디오 데이터 중 입력 오디오 데이터에 이용된 부분을 강조하기 위해, 전력비 벡터를 이용하여 기준 스펙트로그램의 시간 성분별 크기를 조정한다(S508).The audio data processing apparatus adjusts the size of each time component of the reference spectrogram by using the power ratio vector in order to emphasize the portion used for the input audio data among the reference audio data (S508).

오디오 데이터 처리 장치는 대상 스펙트로그램과 조정된 기준 스펙트로그램에 기반하여 차분 스펙트로그램을 생성한다(S510).The audio data processing apparatus generates a differential spectrogram based on the target spectrogram and the adjusted reference spectrogram (S510).

오디오 데이터 처리 장치는 차분 스펙트로그램에 기반하여 출력 오디오 데이터를 생성한다(S512).The audio data processing apparatus generates output audio data based on the differential spectrogram (S512).

이하에서는, 본 발명의 일 실시예에 따른 오디오 데이터 처리 장치가 전력비 벡터를 이용하여 입력 오디오 데이터에 다른 오디오 데이터를 추가하는 과정을 설명한다.Hereinafter, a process in which the audio data processing apparatus according to an embodiment of the present invention adds other audio data to input audio data by using a power ratio vector will be described.

본 발명의 일 실시예에 따른 오디오 데이터 처리 장치는 기준 오디오 데이터가 제거된 입력 오디오 데이터에 다른 오디오 데이터를 추가하는 데 전력비 벡터를 이용할 수 있다.The audio data processing apparatus according to an embodiment of the present invention may use the power ratio vector to add other audio data to the input audio data from which the reference audio data has been removed.

오디오 데이터 처리 장치는 입력 오디오 데이터에 추가할 추가 오디오 데이터를 추가 스펙트로그램으로 변환한다. 오디오 데이터 처리 장치는 전력비 벡터를 이용하여 추가 스펙트로그램을 조정한다. 이는, 입력 오디오 데이터에 추가 오디오 데이터를 자연스럽게 합성하기 위한 과정이다. 추가 스펙트로그램을 조정하는 과정은 도 3에서 조정된 기준 스펙트로그램(330)을 생성하는 과정과 유사하다. 즉, 추가 스펙트로그램과 전력비 벡터의 시간 성분별 곱을 통해 조정된 추가 스펙트로그램을 생성한다.The audio data processing apparatus converts the additional audio data to be added to the input audio data into an additional spectrogram. The audio data processing apparatus adjusts the additional spectrogram using the power ratio vector. This is a process for naturally synthesizing additional audio data into input audio data. The process of adjusting the additional spectrogram is similar to the process of generating the adjusted reference spectrogram 330 in FIG. 3 . That is, the adjusted additional spectrogram is generated through the product of the additional spectrogram and the power ratio vector for each time component.

오디오 데이터 처리 장치는 차분 스펙트로그램과 추가 스펙트로그램에 기반하여 합산 스펙트로그램을 생성한다.The audio data processing apparatus generates the sum spectrogram based on the differential spectrogram and the additional spectrogram.

오디오 데이터 처리 장치는 합산 스펙트로그램을 출력 오디오 데이터로 변환한다. 다시 말하면, 출력 오디오 데이터는 입력 오디오 데이터에서 기준 오디오 데이터가 제거된 후 제거된 부분에 추가 오디오 데이터가 합성된 데이터가 된다.The audio data processing apparatus converts the summed spectrogram into output audio data. In other words, the output audio data becomes data obtained by synthesizing additional audio data in the removed portion after the reference audio data is removed from the input audio data.

한편, 본 발명의 일 실시예에 따른 오디오 데이터 처리 장치는 훈련된 신경망을 이용하여 오디오 데이터를 전처리 또는 후처리할 수 있다. 구체적으로, 합성 오디오 데이터를 훈련된 신경망에 입력한 후 입력 오디오 데이터를 획득하거나, 기준 오디오 데이터가 제거된 출력 오디오 데이터를 훈련된 신경망에 입력함으로써, 출력 오디오 데이터의 노이즈를 감소시킬 수 있다.Meanwhile, the audio data processing apparatus according to an embodiment of the present invention may pre-process or post-process audio data using a trained neural network. Specifically, noise of the output audio data may be reduced by acquiring the input audio data after inputting the synthesized audio data to the trained neural network, or by inputting the output audio data from which the reference audio data is removed to the trained neural network.

도 4및 도 5에서는 과정 S400 내지 과정 S512를 순차적으로 실행하는 것으로 기재하고 있으나, 이는 본 발명의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것이다. 다시 말해, 본 발명의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 4 및 도 5에 기재된 순서를 변경하여 실행하거나 과정 S400 내지 과정 S512 중 하나 이상의 과정을 병렬적으로 실행하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이므로, 도 4 및 도 5는 시계열적인 순서로 한정되는 것은 아니다.Although it is described that processes S400 to S512 are sequentially executed in FIGS. 4 and 5 , this is merely illustrative of the technical idea of an embodiment of the present invention. In other words, those of ordinary skill in the art to which an embodiment of the present invention pertain may change the order described in FIGS. Since it will be possible to apply various modifications and variations to parallel execution of one or more processes in S512, FIGS. 4 and 5 are not limited to a time-series order.

한편, 도 4 및 도 5에 도시된 과정들은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 즉, 이러한 컴퓨터가 읽을 수 있는　기록매체는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등의 비일시적인(non-transitory) 매체일 수 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송) 및 데이터 전송 매체(data transmission medium)와 같은 일시적인(transitory) 매체를 더 포함할 수도 있다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Meanwhile, the processes shown in FIGS. 4 and 5 can be implemented as computer-readable codes on a computer-readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data readable by a computer system is stored. That is, the computer-readable recording medium may be a non-transitory medium such as ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage device, and also carrier wave (for example, , transmission over the Internet) and may further include a transitory medium such as a data transmission medium. In addition, the computer-readable recording medium is distributed in a network-connected computer system so that the computer-readable code can be stored and executed in a distributed manner.

본 발명의 일 실시예에 따른 오디오 데이터 처리 장치는 명령어들(instructions)을 저장하는 적어도 하나의 메모리 및 메모리에 저장된 적어도 하나의 명령어를 실행함으로써, 오디오 데이터를 처리하는 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 메모리는 상기 적어도 하나의 프로세서를 통해 도 1 내지 도 5의 동작 과정을 수행하도록 설정된 명령어들을 저장할 수 있다.An audio data processing apparatus according to an embodiment of the present invention includes at least one memory for storing instructions and at least one processor for processing audio data by executing at least one instruction stored in the memory, The at least one memory may store instructions set to perform the operation process of FIGS. 1 to 5 through the at least one processor.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of this embodiment, and various modifications and variations will be possible by those skilled in the art to which this embodiment belongs without departing from the essential characteristics of the present embodiment. Accordingly, the present embodiments are intended to explain rather than limit the technical spirit of the present embodiment, and the scope of the technical spirit of the present embodiment is not limited by these embodiments. The protection scope of this embodiment should be interpreted by the following claims, and all technical ideas within the equivalent range should be interpreted as being included in the scope of the present embodiment.

302: 입력 스펙트로그램 304: 대상 스펙트로그램
314: 기준 스펙트로그램 320: 전력비 벡터
340: 차분 스펙트로그램302: input spectrogram 304: target spectrogram
314: reference spectrogram 320: power ratio vector
340: differential spectrogram

Claims

A method for processing audio data, comprising:
converting input audio data and reference audio data expressed in a time domain into an input spectrogram and a reference spectrogram expressed in a time-frequency domain, respectively;
normalizing the reference spectrogram;
detecting a target spectrogram that is a region having the highest similarity with a normalized reference spectrogram in the input spectrogram;
generating a power ratio vector that is a ratio of power for each time component of the target spectrogram to power for each time component of the reference spectrogram;
adjusting the size of each time component of the reference spectrogram using the power ratio vector;
generating a differential spectrogram based on the target spectrogram and the adjusted reference spectrogram; and
generating output audio data based on the differential spectrogram;
Audio data processing method comprising a.

According to claim 1,
The process of detecting the target spectrogram,
performing zero padding on the front and rear ends of the input spectrogram for the length of time of the reference spectrogram; and
determining, as the target spectrogram, a region in which the sum of the products of each element with the normalized reference spectrogram in the zero-padded input spectrogram is greatest;
Audio data processing method comprising a.

According to claim 1,
The process of generating the power ratio vector is,
adjusting the target spectrogram through elementwise product of the target spectrogram and the normalized reference spectrogram;
separately generating a target power vector and a reference power vector indicating the adjusted target spectrogram and power for each time component with respect to the reference spectrogram;
a process of substituting 0 for a value of an element having a value smaller than a threshold value among elements of a target power vector; and
determining a ratio of the reference power vector to the substituted target power vector as the power ratio vector;
Audio data processing method comprising a.

According to claim 1,
The process of generating the power ratio vector is,
adjusting the target spectrogram through elementwise product of the target spectrogram and the normalized reference spectrogram;
separately generating a target power vector and a reference power vector indicating power for each time component with respect to the adjusted target spectrogram and the reference spectrogram;
detecting one or more sequences having a value smaller than a threshold value in a target power vector and having the same value as at least one of adjacent elements;
determining one sequence having the longest length among the one or more sequences;
substituting 0 for values of elements other than the one sequence and an element having a value greater than the threshold value in the target power vector; and
determining a ratio of the reference power vector to the substituted target power vector as the power ratio vector;
Audio data processing method comprising a.

5. The method of any one of claims 3 or 4,
The target power vector is a vector derived from the sum of products of the target spectrogram and the reference spectrogram for each time component,
The reference power vector is an audio data processing method in which the reference spectrogram is a vector derived from a sum of products of each time component with the reference spectrogram.

According to claim 1,
The process of generating the output audio data includes:
converting additional audio data to be added to the input audio data into an additional spectrogram;
adjusting the additional spectrogram using the power ratio vector;
generating a summed spectrogram based on the differential spectrogram and the additional spectrogram; and
converting the summed spectrogram into the output audio data;
Audio data processing method comprising a.

A computer-readable recording medium in which a program for executing the method of any one of claims 1 to 6 in a computer is recorded.