KR100978913B1

KR100978913B1 - A query by humming system using plural matching algorithm based on svm

Info

Publication number: KR100978913B1
Application number: KR1020090133972A
Authority: KR
Inventors: 박성주; 송재종; 이석필; 박강령; 김재희; 남현하; 남기표; 누엉티투창; 이의철
Original assignee: 전자부품연구원
Priority date: 2009-12-30
Filing date: 2009-12-30
Publication date: 2010-08-31

Abstract

PURPOSE: A sound source search system and a method combining matching algorithms based on a plurality of SVM searching humming are provided to combine DTW(Dynamic Time Warping), EMD(Earth Mover's Distance) and matching score obtained from LS(Linear Scaling) method thereby providing exact sound source search result. CONSTITUTION: An audio signal analyzer(120) analyzes audio signal and extracts feature value of the audio signal. An audio signal matching searcher](140) uses feature value of corrected audio signal and applies the first and the second matching algorithm and produces the first and the second output value. The audio signal matching searcher uses the first and the second output value as an input and produces similarity measured value and searches and extracts sound source having high similarity based on the produced similarity measured value from sound source data base.

Description

A sound source retrieval system and method combining SMB-based multiple matching algorithms {A ＱＵEＲＹＢＹ HＵＭＭＩＮＳＹＳＴＥＭＵＳＩＮＧ PLC

본 발명은 SVM 기반의 복 수의 매칭 알고리즘을 결합한 음원 검색 시스템 및 방법에 관한 것으로서, 보다 상세하게는 허밍과 같은 오디오 신호를 입력 받아 이를 필터링하고 정규화하며, 정규화 된 오디오 신호의 다양한 피치 특성들에 복 수의 서로 다른 매칭 알고리즘을 적용하여 산출된 매칭스코어들을 SVM을 통해 결합하여 입력된 오디오 신호와 일치하는 음원을 데이터베이스로부터 검색하는 시스템 및 방법에 관한 것이다.The present invention relates to a sound source retrieval system and method incorporating a plurality of SVM-based matching algorithms, and more particularly, to receive and filter and normalize an audio signal such as a humming, and to various pitch characteristics of a normalized audio signal. The present invention relates to a system and method for retrieving a sound source from a database by matching matching scores calculated by applying a plurality of different matching algorithms through an SVM.

종래의 음원 검색 방법은 입력된 허밍 멜로디에서 각 음표 간 상대화 값을 취하여 심볼 멜로디(Symbol Melody) 시퀀스를 생성하고, 동적 프로그래밍(DP)매칭을 이용하여 데이터베이스에 저장되어 있는 음원과의 특징값과 유사도를 측정하고 정합의 성공 여부를 결정한다.Conventional sound source search method takes a symbol melody sequence by taking the relative value of each note in the input humming melody, and similarity with the feature value of the sound source stored in the database using dynamic programming (DP) matching Measure and determine the success of the match.

이러한 방법에서 멜로디 유사도 계산기는 입력된 심볼 멜로디 시퀀스와 메타 데이터 DB에 있는 각 오디오의 특징값과의 유사도를 계산한다. 이때, 유사도는 DP 매칭을 이용하여 두 벡터 사이의 거리를 측정하는 유클리디안(Euclidean)거리나 절대 차이값의 합(Sum of absolute difference)등의 방법을 사용한다. 이후에 거리 기반 분류기에서는 계산된 거리 정보들을 거리 값이 가장 작은 순서로 정렬(sorting)함으로서, 해당하는 메타데이터 정보를 메타데이터 DB에서 추출하여 검색결과 정보로서 인터넷을 통해 클라이언트 측으로 전달한다.In this method, the melody similarity calculator calculates the similarity between the input symbol melody sequence and the feature value of each audio in the metadata DB. In this case, the similarity may be a method such as Euclidean distance or Sum of absolute difference, which measures the distance between two vectors using DP matching. After that, the distance-based classifier sorts the calculated distance information in the order of the smallest distance value, extracts the corresponding metadata information from the metadata DB, and delivers the metadata information to the client through the Internet as search result information.

종래의 음원 검색 방법 중 다른 하나로는 입력된 멜로디의 주파수 스펙트럼 에너지 분포(Spectrum Energy Distribution)를 이용하여 특이점(breakpoint)을 설정하고 여기를 기준으로 입력 멜로디를 명확한 음들로 변환하여 나타낸다. 신뢰수준(confidence level)은 스펙트럼 에너지 분포(Spectrum Energy Distribution)의 indicator 값, 입력 멜로디의 에너지 수준에 따라서 결정된다. 매칭 과정은 음표들을 삽입하고 삭제했을 때 발생되는 오차와, 신뢰수준과 관련된 오차를 최소화하는 방식으로 이루어진다.In another conventional sound source searching method, a breakpoint is set using a frequency spectrum energy distribution of an input melody, and the input melody is converted into clear sounds based on the excitation point. The confidence level is determined by the indicator value of the Spectrum Energy Distribution and the energy level of the input melody. The matching process is performed in such a way as to minimize the error associated with inserting and deleting the notes and the error related to the confidence level.

이 방법에서 사용된 중요한 특징은 instrumentation과 beat이다. Instrumentation 측정을 위해서 EMD (Earth Mover's Distance)라는 스펙트럼 기반의 거리측정 방법을 사용하였고, beat를 위해서는 박자 기반의 측정(rythmic based measure) 방법을 사용하였다. 이 두 방법으로부터 측정된 값에 가중치를 부여하여 유사도를 측정한다. Important features used in this method are instrumentation and beat. For instrumentation measurements, we used a spectrum-based distance measurement method called EMD (Earth Mover's Distance), and a beat based rythmic based measure method. The similarity is measured by weighting the measured values from these two methods.

또 다른 방법에서는 humming에서 피치추출기(pitch tracker)와 특징 변환기(feature converter)를 통해 특징을 추출하여, 멜로디 데이터베이스의 템플릿과 정합함으로써 유사도를 측정한다.In another method, the similarity is measured by extracting a feature through a pitch tracker and a feature converter in humming and matching it with a template of a melody database.

이 방법에서는 먼저 EMD 알고리즘을 이용하여 거리를 측정하고, 거리값에 대한 오름차순으로 정렬하여 임계치를 정한다. 정해진 임계치보다 작은 후보들을 제외한 나머지들은 제거된다. 그 후 후보로 지정된 것들에 대하여 DTW (Dynamic Time Warping) 알고리즘을 통한 정합을 수행하여 가장 작은 거리값을 가지는 것을 데이터베이스에서 선택한다.In this method, the distance is first measured using the EMD algorithm, and then the threshold is determined by ascending order of distance values. The remainders are removed except for candidates smaller than the predetermined threshold. The candidates are then matched using the DTW (Dynamic Time Warping) algorithm to select the one with the smallest distance from the database.

기존 관련 방법은 매칭 단계에서, 단순한 방법의 유클리디안 거리나 절대 차이값의 합을 사용하거나, 두 개 이상의 측정 방법을 사용해 나온 매칭 스코어(유사도 결과값)를 특정 규칙에 따라 가중치 합으로 결합하는 방법, 또는 순차적으로 매칭 알고리즘을 적용하여 검색해야할 후보군을 감소시키는 방법으로 결합하여 사용하였다. Existing related methods use the simpler method of Euclidean distance or absolute difference, or combine matching scores (similarity results) from two or more measurement methods into weighted sums according to specific rules. It was used by combining a method or a method of reducing candidate groups to be searched by applying a matching algorithm sequentially.

하지만, 단순한 유클리디안 거리나, 절대 차이값을 사용하는 경우에는 입력된 허밍의 조(chord)가 다르거나 잡음이 포함된 경우 큰 에러를 발생시킬 수 있다. 또한 두 개 이상의 매칭 방법들을 사용할 때 특정 규칙에 따라 각 매칭방법에 가중 치(신뢰도)를 부여하여 결합하는 경우는 명백한 이론적 근거보다 기존에 알려진 각 방법들의 정확도를 기반으로 가중치가 경험적으로 결정되므로, 매칭 환경에 따라 적응적이지 못하다는 문제점이 있다. 계층적 혹은 단계별(순차적)로 여러 다른 매칭 방법을 적용하여 검색해야할 후보군을 감소시키는 경우 단일 임계치를 기준으로 매칭 결과를 산출하게 되므로, 적은 잡음이 포함된 허밍의 경우에도 맞는 음원을 초기 단계에서 잘못 제외시켜버리는 문제점이 발생할 수 있다.However, when a simple Euclidean distance or an absolute difference value is used, a large error may occur when the chord of the input hum is different or noise is included. In addition, when using two or more matching methods, weighting (reliability) is combined to each matching method according to a specific rule, so that weights are empirically determined based on the accuracy of each known method rather than an obvious theoretical basis. There is a problem that it is not adaptive depending on the matching environment. In the case of reducing candidates to be searched by applying different matching methods hierarchically or stepwise (sequential), the matching result is calculated based on a single threshold, so that the correct sound source is incorrect at an early stage even in the case of humming with low noise. Exclusion may occur.

상기의 문제점을 해결하기 위하여 DTW, EMD 및 LS 매칭알고리즘에서 계산된 매칭 정합값들을 같은 단계에서 SVM으로 정합값 결합(score level fusion)함으로써 위와 같은 기존 방법들의 문제점을 해결하고자 한다.In order to solve the above problems, the problem of the existing methods is solved by matching the match values calculated in the DTW, EMD and LS matching algorithms to the SVM in the same step.

또한, QbH 시스템에서 DTW, EMD 그리고 LS 매칭방법에서 얻어진 매칭 정합값을 같은 단계에서 결합함으로써, 단일 매칭(정합) 방법을 사용하거나 차등 가중치(weighted SUM)에 의한 정합값 결합을 사용하는 경우보다, 정확한 음원 검색 결과를 제공하는 것을 본 발명의 목적으로 한다.In addition, in the QbH system, by matching matching values obtained from DTW, EMD, and LS matching methods in the same step, than using a single matching method or using a combination of weighted SUMs, It is an object of the present invention to provide accurate sound source search results.

상술한 목적을 달성하기 위한 본 발명의 복 수의 매칭 알고리즘을 결합한 음원 검색 시스템은, 오디오 신호를 입력 받기 위한 오디오 신호 입력부와, 오디오 신호를 분석하여 오디오 신호의 특징을 추출하는 오디오 신호 분석부와, 오디오 신호의 특징을 고려하여 오디오 신호를 보정하는 오디오 신호 보정부와, 보정된 오디오 신호의 특징을 입력으로 하여 제1 매칭 알고리즘을 적용하여 제1출력값을 산출하고, 보정된 오디오 신호의 특징을 입력으로 하여 제2 매칭 알고리즘을 적용하여 제2출력값을 산출하고, 제1출력값 및 제2출력값을 결합하여 유사도 측정값을 산출하며, 산출된 유사도 측정값을 이용하여 유사도가 높은 음원을 음원 데이터베이스로부터 검색하여 추출해 내는 오디오 신호 매칭검색부를 포함할 수 있다.A sound source retrieval system incorporating a plurality of matching algorithms of the present invention for achieving the above object includes an audio signal input unit for receiving an audio signal, an audio signal analysis unit for analyzing the audio signal and extracting features of the audio signal; The audio signal correction unit corrects the audio signal in consideration of the characteristics of the audio signal, calculates a first output value by applying a first matching algorithm by inputting the corrected characteristics of the audio signal, and calculates the characteristics of the corrected audio signal. As a input, a second output value is calculated by applying a second matching algorithm, a similarity measurement value is calculated by combining the first output value and the second output value, and a sound source with high similarity is calculated from the sound source database using the calculated similarity measurement value. It may include an audio signal matching search unit for searching and extracting.

또한 전술한 구성에서, 오디오 신호 보정부는, 분석된 오디오 신호에서 피치 가 제로인 음을 제거하는 제로제거부를 더 포함할 수 있다.Also, in the above-described configuration, the audio signal corrector may further include a zero remover that removes a sound having a zero pitch from the analyzed audio signal.

또한 전술한 구성에서, 오디오 신호 보정부는, 분석된 오디오 신호에 대하여 소정의 필터길이 단위로 소정의 필터링을 수행하는 필터부를 포함할 수 있다.In addition, in the above-described configuration, the audio signal corrector may include a filter that performs predetermined filtering on the analyzed audio signal in units of predetermined filter lengths.

또한 전술한 구성에서, 필터부는, 분석된 오디오 신호에 대하여 소정의 필터길이 단위로 메디안 필터링을 수행하는 메디안 필터부와, 메디안 필터링이 적용된 오디오 신호에 대하여 소정의 필터길이 단위로 평균 필터링을 수행하는 평균 필터부를 포함할 수 있다.In addition, in the above-described configuration, the filter unit may include a median filter unit that performs median filtering on the analyzed audio signal by a predetermined filter length unit, and performs an average filtering on the audio signal to which the median filtering is applied by a predetermined filter length unit. An average filter unit may be included.

또한 전술한 구성에서, 오디오 신호 보정부는, 피치의 최대값과 최소값을 이용하여 필터링된 오디오 신호의 피치값을 스케일링하는 스케일링부를 더 포함할 수 있다.In addition, in the above-described configuration, the audio signal corrector may further include a scaling unit that scales the pitch value of the filtered audio signal using the maximum and minimum values of the pitch.

또한 전술한 구성에서, 보정된 오디오 신호의 특징은 보정된 오디오 신호의 피치값들을 포함하고, Also in the above-described configuration, the feature of the corrected audio signal includes pitch values of the corrected audio signal,

오디오 신호 매칭검색부는, 보정된 오디오 신호의 특징을 입력으로 하여 DTW 매칭 알고리즘을 적용하여 제1출력값을 산출하고, 보정된 오디오 신호의 특징을 입력으로 하여 EMD 매칭 알고리즘을 적용하여 제2출력값을 산출하고, 상기 보정된 오디오 신호의 특징을 입력으로 하여 LS 매칭 알고리즘을 적용하여 제3출력값을 산출하며, 제1출력값, 제2출력값 및 제3출력값을 결합하여 상기 유사도 측정값을 산출할 수 있다.The audio signal matching search unit calculates a first output value by applying a DTW matching algorithm by inputting a feature of the corrected audio signal, and calculates a second output value by applying an EMD matching algorithm by inputting the feature of the corrected audio signal. The third output value may be calculated by applying an LS matching algorithm using the corrected audio signal as an input, and the similarity measurement value may be calculated by combining the first output value, the second output value, and the third output value.

또한 전술한 구성에서, 오디오 신호 매칭검색부는, 매칭 알고리즘들을 적용하여 산출된 적어도 2 이상의 출력들을 입력받아 SVM(Support Vector Machine)을 이용하여 결합함으로써 유사도 측정값을 출력으로 생성할 수 있다.In addition, in the above-described configuration, the audio signal matching search unit may receive at least two or more outputs calculated by applying matching algorithms and combine them using an SVM (Support Vector Machine) to generate a similarity measurement value as an output.

또한 상술한 목적을 달성하기 위한 본 발명의 복수의 매칭 알고리즘을 결합한 음원 검색 방법은, 오디오 신호를 입력 받는 단계와, 오디오 신호를 분석하여 오디오 신호의 특징을 추출하는 단계와, 분석된 오디오 신호에서 추출된 특징을 고려하여 오디오 신호를 보정하는 단계와, 보정된 오디오 신호의 특징을 입력으로 하여 제1 매칭 알고리즘을 적용하여 제1출력값을 산출하고, 보정된 오디오 신호의 특징을 입력으로 하여 제2 매칭 알고리즘을 적용하여 제2출력값을 산출하는 단계와, 제1출력값 및 제2출력값을 결합하여 산출된 유사도 측정값을 이용하여 입력된 오디오 신호와 유사도가 높은 음원을 음원 데이터베이스로부터 검색하여 추출해 내는 단계를 포함할 수 있다.In addition, a sound source search method combining a plurality of matching algorithms of the present invention for achieving the above object, the step of receiving an audio signal, analyzing the audio signal to extract the features of the audio signal, and in the analyzed audio signal Correcting the audio signal in consideration of the extracted feature; calculating a first output value by applying a first matching algorithm by inputting the corrected feature of the audio signal; and applying a second feature of the corrected audio signal as an input. Calculating a second output value by applying a matching algorithm; searching for and extracting a sound source having a high similarity to the input audio signal from a sound source database using a similarity measure calculated by combining the first output value and the second output value; It may include.

또한 전술한 구성에서, 오디오 신호를 보정하는 단계는, 분석된 오디오 신호로부터 피치가 제로인 음을 제거하는 단계와, 제로인 음이 제거된 오디오 신호를 필터링하는 단계와, 피치의 최대값과 최소값을 이용하여 필터링된 오디오 신호의 피치값을 스케일링하는 단계를 포함할 수 있다.Also, in the above-described configuration, the step of correcting the audio signal may include removing a zero pitch sound from the analyzed audio signal, filtering an audio signal from which zero pitch is removed, and using the maximum and minimum values of the pitch. And scaling a pitch value of the filtered audio signal.

또한 전술한 구성에서, 필터링하는 단계는, 제로인 음이 제거된 오디오 신호를 소정의 필터길이 단위로 메디안 필터링을 수행하는 단계와, 메디안 필터링이 적용된 오디오 신호에 대하여 소정의 필터길이 단위로 평균 필터링을 수행하는 단계를 포함할 수 있다.In the above-described configuration, the filtering may include performing median filtering of the audio signal from which zero sound is removed in a predetermined filter length unit, and performing average filtering of the audio signal to which the median filtering is applied in a predetermined filter length unit. It may include the step of performing.

또한 전술한 구성에서, 매칭 알고리즘들을 적용하여 산출된 출력값들은 SVM(Support Vector Machine)을 이용하여 결합되는 것을 특징으로 할 수 있다.
상기 제1 매칭 알고리즘 및 상기 제2 매칭 알고리즘은 DTW 매칭 알고리즘, EMD 매칭 알고리즘 또는 LS 매칭 알고리즘 중 어느 하나인 것이 바람직하다.In addition, in the above configuration, the output values calculated by applying matching algorithms may be combined using a SVM (Support Vector Machine).
Preferably, the first matching algorithm and the second matching algorithm are any one of a DTW matching algorithm, an EMD matching algorithm, and an LS matching algorithm.

본 발명에 따르면, QbH 시스템을 이용하여 음원 검색을 할 때 서로 다른 알고리즘인 DTW, EMD 및 LS 매칭 알고리즘을 사용하여 계산된 3가지 매칭 스코어를 결합함으로써 보다 정확한 음원 검색이 가능한 효과가 있다.According to the present invention, when searching a sound source using the QbH system, a more accurate sound source search is possible by combining three matching scores calculated using different algorithms such as DTW, EMD, and LS matching algorithms.

또한, DTW, EMD 및 LS 알고리즘의 매칭 정합값(매칭 스코어)를 같은 단계에서 결합하기 때문에 정보의 손실을 막을 수 있고 보다 정확한 비선형 분류기를 얻을 수 있다.In addition, matching matching values (matching scores) of DTW, EMD and LS algorithms are combined in the same step, thereby preventing information loss and obtaining a more accurate nonlinear classifier.

또한, SVM을 사용하는 경우에는 세 가지의 매칭 정합값들을 결합할 경우 하나의 매칭 정합값을 사용하거나 계층적(순차적)으로 매칭 정합값들을 결합할 때보다 정확한 결과를 제공할 수 있는 효과가 있다.In addition, in the case of using the SVM, combining three matching matches can provide more accurate results than using one matching match or combining the matching matches hierarchically (sequentially). .

전술한, 그리고 추가적인 본 발명의 양상들은 첨부된 도면을 참조하여 설명되는 바람직한 실시 예들을 통하여 더욱 명백해질 것이다. 이하에서는 본 발명의 이러한 실시 예를 통하여 당업자가 용이하게 이해하고 재현할 수 있도록 상세하게 설명하기로 한다.The foregoing and further aspects of the present invention will become more apparent through the preferred embodiments described with reference to the accompanying drawings. Hereinafter will be described in detail to enable those skilled in the art to easily understand and reproduce through these embodiments of the present invention.

도 1 은 본 발명의 일 실시예에 따른 음원 검색 시스템을 도시한 블록도이다.1 is a block diagram illustrating a sound source search system according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 음원 검색 시스템(100)은 오디오 신호 입력부(110), 오디오 신호 분석부(120), 오디오 신호 보정부(130), 오디오 신호 매칭검색부(140)를 포함하여 이루어진다.As shown in FIG. 1, the sound source search system 100 of the present invention includes an audio signal input unit 110, an audio signal analyzer 120, an audio signal corrector 130, and an audio signal match search unit 140. It is made to include.

오디오 신호 입력부(110)는 허밍, 노래 또는 음원의 일부를 입력으로 받을 수 있다. 입력된 오디오 신호는 저장부(미도시)에 저장될 수 있고, 오디오 신호 입력부는 오디오 신호를 직접 입력 받거나 음원 검색 시스템에서 인식 가능한 음원 파일 형태로 오디오 신호를 입력받을 수 있도록 구성될 수 있다.The audio signal input unit 110 may receive a part of a humming, song, or sound source as an input. The input audio signal may be stored in a storage unit, and the audio signal input unit may be configured to directly receive the audio signal or to receive the audio signal in the form of a sound source file recognizable by the sound source search system.

오디오 신호 분석부(120)는 오디오 신호 입력부(110)를 통해 입력된 오디오 신호를 분석하여 특징값들을 추출한다. 추출되는 특징값으로는 입력된 신호의 피치(pitch) 정보, 인스트루멘테이션(instrumentation) 정보, 템포(tempo) 정보 또는 비트(beat) 정보 등이 있다.The audio signal analyzer 120 extracts feature values by analyzing the audio signal input through the audio signal inputter 110. The extracted feature values include pitch information, instrumentation information, tempo information, or beat information of the input signal.

오디오 신호 보정부(130)는 입력된 오디오 신호의 주파수 대역과 주파수의 평균값을 산출한다. 또한 피치값의 평균값을 산출한다. 일반적으로 허밍의 경우에는 검색하고자 하는 음원과 note, pitch, 주파수 등에서의 차이가 있어 유사 음원 검색 시 검색 성능의 저하를 가져오기 때문에 음원 데이터베이스에 저장되어 있는 음원과 입력된 오디오 신호의 유사도를 비교하기 전에 주파수의 평균값들을 산출한 후 입력된 오디오 신호의 주파수 정보를 음원의 주파수 정보와 유사하게 보정을 수행하게 된다(Mean shifting). 또한, 오디오 신호 보정부(130)는 이러한 Mean shifting을 거친 후에도 매칭되는 음원 검색의 성능을 향상시키기 위한 다양한 보정들을 수행하며 자세한 내용은 추후 자세히 후술하기로 한다.The audio signal corrector 130 calculates an average value of the frequency band and the frequency of the input audio signal. The average value of the pitch values is also calculated. In general, in the case of humming, there is a difference in note, pitch, frequency, etc. with the sound source to be searched, and thus the search performance is reduced when searching for similar sound sources. After calculating the average values of the frequencies, the frequency information of the input audio signal is corrected similarly to the frequency information of the sound source (Mean shifting). In addition, the audio signal correction unit 130 performs various corrections to improve the performance of the matching sound source search even after the mean shifting, and details thereof will be described later in detail.

오디오 신호 매칭검색부(140)는 적어도 2 이상의 알고리즘을 수행하여 매칭 스코어를 생성하는 복수의 매칭검색부와 복수의 매칭검색부에서 산출된 매칭 스코어들을 입력으로 하여 SVM(Support Vector Machine)을 통해 최종 유사도 측정값을 산출한다. The audio signal matching search unit 140 inputs a plurality of matching search units for generating a matching score by performing at least two algorithms and the matching scores calculated by the plurality of matching search units, and finally receives them through the SVM (Support Vector Machine). Calculate the similarity measure.

매칭검색부는 DTW(Dynamic time warping) 매칭 알고리즘, EMD(Earth Mover's distance) 매칭 알고리즘 및 LS 매칭 알고리즘 등을 사용하여 매칭 스코어를 생성하는 것이 바람직하다. The matching search unit may generate a matching score by using a dynamic time warping (DTW) matching algorithm, an early moveover distance (EMD) matching algorithm, an LS matching algorithm, or the like.

도 2는 본 발명의 일 실시예에 따른 오디오 신호 보정부를 도시한 블록도이다.2 is a block diagram illustrating an audio signal correction unit according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 오디오 신호 보정부(200)는 제로제거부(210), 평균 보정부(220), 메디안 필터부(230), 평균 필터부(240), 스케일링부(250)를 포함하여 이루어진다.As shown in FIG. 2, the audio signal corrector 200 includes a zero remover 210, an average corrector 220, a median filter unit 230, an average filter unit 240, and a scaling unit 250. It is made to include.

제로제거부(210)는 입력된 오디오 신호에서 피치 값이 0인 신호 즉, 묵음에 해당하는 신호들을 제거한다.The zero remover 210 removes a signal having a pitch value of 0, that is, a signal corresponding to silence, from the input audio signal.

평균 보정부(220)는 입력된 오디오 신호가 비교 대상 음원보다 음의 높이가 높거나 낮을 수 있으므로 이를 어느 정도 유사하게 보정하는 과정을 수행한다.The average corrector 220 performs a similarly corrected process to some extent because the input audio signal may have a higher or lower pitch than the target sound source.

메디안 필터부(230)는 입력된 오디오 신호에 포함되는 노이즈를 제거하기 위하여 메디안 필터링을 수행한다. 메디안 필터링은 랜덤하게 발생한 피크 노이즈를 제거하는데 사용된다.The median filter 230 performs median filtering to remove noise included in the input audio signal. Median filtering is used to remove randomly generated peak noise.

평균 필터부(240)는 메디안 필터링을 수행한 오디오 신호에 대하여 음의 흔 들림을 보정하여 부드럽게 하기 위해 평균 필터링을 수행한다. The average filter unit 240 performs average filtering to correct and smooth the shaking of the sound with respect to the median filtering audio signal.

스케일링부(250)는 입력된 오디오 신호와 비교 대상 음원 파일의 피치의 범위의 차이가 있는 점을 해결하기 위하여 피치값들의 범위가 일치하도록 Min-Max 스케일링을 수행한다. 이를 위해 스케일링부(250)는 입력된 오디오 신호와 비교 대상 음원 신호의 피치의 최소값과 최대값을 각각 산출하고 최소값과 최대값 사이의 범위가 일치되도록 정규화하는 과정을 수행한다.The scaling unit 250 performs Min-Max scaling so that the ranges of pitch values coincide with each other in order to solve the difference in the pitch range of the input audio signal and the comparison target sound file. To this end, the scaling unit 250 calculates the minimum and maximum values of the pitches of the input audio signal and the comparison target sound source signal and normalizes them so that the ranges between the minimum and maximum values match.

이하에서는 오디오 신호 보정부(200)가 수행하는 기능을 자세히 상술하기로 한다.Hereinafter, a function performed by the audio signal corrector 200 will be described in detail.

도 3은 입력된 오디오 신호와 비교 대상 음원의 주파수 변화를 도시한 것이다. 도 3에 도시된 바와 같이, 입력된 오디오 신호와 음원 파일의 데이터의 주파수 평균값이 일치하지 않는다. 따라서 평균 보정부(220)는 오디오 신호의 주파수 평균값을 비교 대상 음원의 주파수 평균값과 일치하도록 보정한다. 보다 바람직하게는, 입력된 오디오 신호와 비교 대상 음원의 주파수 평균값을 모두 0으로 일치시키도록 보정할 수 있다.3 illustrates a change in frequency of an input audio signal and a comparison target sound source. As shown in Fig. 3, the frequency average value of the input audio signal and the data of the sound source file do not coincide. Therefore, the average corrector 220 corrects the frequency average value of the audio signal to match the frequency average value of the sound source to be compared. More preferably, it is possible to correct the frequency average of the input audio signal and the comparison target sound source to match all zeros.

도 4는 비교 대상 음원과 입력된 오디오 신호의 추출된 피치 정보를 도시한 것이다.4 illustrates extracted pitch information of a comparison target sound source and an input audio signal.

도 4 (b)에 도시된 바와 같이, 입력된 오디오 신호는 pitch contour 내에 적지 않은 피크 포인트들이 발생한다. 이의 주요원인으로는 허밍과 같은 오디오 입력에서 발생되는 목소리의 떨림이나 주변 혹은 입력단자에서 발생되는 잡음 등이 있 다. 이러한 노이즈들로 인하여 매칭 검색의 성능이 저하되기 때문에 오디오 신호 보정부(200)는 다양한 필터링을 통하여 입력된 오디오 신호를 보정한다. 도 4 (b)에 도시된 바와 같이, 입력된 오디오 신호는 피치가 0인 음은 제거되었음을 알 수 있다. 또한 입력된 오디오 신호가 비교 대상 음원 파일의 평균 음 높이 보다 낮은 것을 알 수 있는데 이는 평균 보정부에 의해 보정될 수 있음은 위에서 설명한 바와 같다.As shown in FIG. 4B, the input audio signal generates a number of peak points within the pitch contour. The main cause of this is the tremor of voices generated from audio inputs such as hum and the noise generated from the surroundings or input terminals. Since the performance of the matching search is degraded due to these noises, the audio signal corrector 200 corrects the input audio signal through various filtering. As shown in (b) of FIG. 4, it can be seen that the input audio signal has been removed from the pitch having zero pitch. In addition, it can be seen that the input audio signal is lower than the average sound height of the sound source file to be compared, which can be corrected by the average correction unit as described above.

도 5는 메디안 필터부에서 수행되는 메디안 필터링을 설명하는 도면이다.5 is a diagram illustrating median filtering performed by a median filter unit.

메디안 필터부는 소정의 크기의 필터 사이즈를 갖는다. 일반적으로 필터 사이즈는 기수로 설정되는 것이 바람직하며 도 5에는 필터 사이즈가 5인 메디안 필터를 도시하고 있다. 편의상 필터 사이즈가 5인 경우로 가정하여 설명하기로 한다.The median filter portion has a filter size of a predetermined size. In general, the filter size is preferably set in radix, and FIG. 5 shows a median filter having a filter size of 5. FIG. For convenience, the filter size is assumed to be 5 and will be described.

메디안 필터부는 입력된 오디오 신호의 첫 번째부터 다섯 번째 음의 피치를 오름차순으로 재배열한다. 다음으로 가운데인 세 번째에 정렬된 값(도 5의 경우 4)를 세 번째 음의 피치값으로 대체한다. 즉, 5개 음의 중간값인 4를 세 번째 음의 피치로 보정하여 노이즈 등으로 발생한 피크 값을 보정하게 된다. 다음으로, 입력된 오디오 신호의 두 번째부터 여섯 번째 음의 피치에 대하여 피치값을 오름차순으로 배열하고 중간값을 네 번째 음의 피치 값으로 보정한다. 다음으로는 세 번째부터 일곱 번째 음에 대해 필터링을 수행하고 중간값을 다섯 번째 음의 피치 값으로 보정한다. 이러한 과정을 순차적으로 반복하여, 마지막 다섯음의 피치값을 순차적으로 배열한 후 필터링을 수행할 때까지 반복한다. 메디안 필터링을 완료하면 시작 부분의 두 음과 마지막 부분의 두 음을 제외하고는 모든 피치값에 대하여 보정이 이루어지게 된다.The median filter unit rearranges the pitches of the first to fifth notes of the input audio signal in ascending order. Next, replace the third centered value (4 in Figure 5) with the third pitch value. That is, the peak value generated by noise or the like is corrected by correcting the median value of five notes to four pitches. Next, the pitch values are arranged in ascending order for the pitches of the second to sixth notes of the input audio signal, and the median value is corrected to the pitch values of the fourth note. Next, filter the third to seventh notes and correct the median to the fifth pitch value. This process is repeated sequentially, the pitch values of the last five notes are arranged in sequence and then repeated until filtering is performed. When the median filtering is completed, all pitch values are corrected except for the first two notes and the last two notes.

도 6은 평균 필터부에서 수행되는 평균 필터링을 설명하는 도면이다.6 is a view for explaining average filtering performed by the average filter unit.

평균 필터링은 메디안 필터링 과정과 전체적으로 유사하나 대체되는 값이 중간값이 아닌, 필터 길이에 포함되는 피치값들 전체의 평균값인 점에서 차이가 있다. 도 6에 도시된 바와 같이 필터 사이즈가 5인 경우 5개의 피치값의 평균값인 8.6을 산출하고 소수점 이하를 내림한 8이 세 번째 음의 피치값으로 보정될 수 있다. 평균값이 정수가 아닌 경우 반올림을 수행하거나, 정수가 아닌 값을 보정하도록 설계하여도 무방하다.Average filtering is similar to the median filtering process as a whole, but differs in that the replaced value is not an intermediate value but an average value of all pitch values included in the filter length. As shown in FIG. 6, when the filter size is 5, 8.6, which is an average value of five pitch values, is calculated, and 8, which is lower than the decimal point, may be corrected to the third negative pitch value. If the mean value is not an integer, rounding may be performed or it may be designed to correct a non-integer value.

평균 필터링을 먼저 수행하고 메디안 필터링을 수행하는 경우에는 메디안 필터링의 수행이 제대로 이루어지지 않는 것을 확인하였다. 따라서 메디안 필터링을 수행하여 피크 값들을 완화한 후 평균필터링을 수행하는 것이 보다 바람직하다.When the average filtering is performed first and the median filtering is performed, it is confirmed that the median filtering is not properly performed. Therefore, it is more preferable to perform median filtering to relax peak values and then perform average filtering.

도 7은 스케일링부에서 수행되는 Min-Max 스케일링을 설명하는 도면이다.7 is a diagram illustrating Min-Max scaling performed in the scaling unit.

위에서 설명한 바와 같이 입력된 오디오 신호와 비교 대상 음원은 음의 높이가 일치하지 않을 수 있다. 이 경우에는 최소값과 최대값의 차이 역시 일치하지 않는 문제가 발생하게 된다. 즉, 입력된 오디오 신호가 허밍 입력으로 피치 값이 비교 대상 음원보다 낮은 경우 최소값과 최대값의 차이가 비교 대상 음원의 피치의 최대값과 최소값의 차이보다 더 작게 발생하게 된다. 예컨대, 비교 대상음원의 피 치값이 63, 65, 69 인데 허밍 입력의 피치값은 60, 62, 64로 입력되는 경우 비교 대상 음원과 입력 오디오 신호의 피치값의 최대값은 모두 5로, 최소값은 -5로 스케일링하고 나머지 피치값들은 비율에 맞게 -5에서 5 사이의 값으로 스케일링되도록 할 수 있다.As described above, the height of the sound may not match with the input audio signal and the comparison target sound source. In this case, the difference between the minimum value and the maximum value also does not coincide. That is, when the input audio signal has a pitch value lower than the comparison target sound source by the humming input, a difference between the minimum value and the maximum value occurs smaller than the difference between the maximum value and the minimum value of the pitch of the comparison target sound source. For example, if the pitch value of the sound source to be compared is 63, 65, or 69, but the pitch value of the humming input is 60, 62, 64, the maximum value of the pitch values of the sound source and the input audio signal is 5 and the minimum value is both. It can be scaled to -5 and the remaining pitch values scaled from -5 to 5 in proportion to the ratio.

다음으로 오디오 신호 매칭검색부의 기능에 대하여 자세히 설명하기로 한다.Next, the function of the audio signal matching search unit will be described in detail.

도 8은 본 발명의 일실시예에 따른 오디오 신호 매칭검색부를 도시한 블록도이다.8 is a block diagram illustrating an audio signal matching search unit according to an embodiment of the present invention.

오디오 신호 매칭검색부(800)는 적어도 2 이상의 서로 다른 매칭 알고리즘을 수행하는데, 바람직하게는 입력된 오디오 신호의 피치 정보를 이용한 DTW 매칭 검색(810), 입력된 오디오 신호의 피치 정보를 이용한 EMD 매칭 검색(820) 및 입력된 오디오 신호의 피치 정보를 이용한 LS 매칭 검색(830) 결과를 산출한다. 적용되는 매칭 알고리즘은 RA(Recursive Alignment) 매칭 알고리즘을 포함할 수도 있다.The audio signal matching search unit 800 performs at least two different matching algorithms, preferably DTW matching search using the pitch information of the input audio signal 810, EMD matching using the pitch information of the input audio signal. The LS matching search 830 using the search 820 and the pitch information of the input audio signal is calculated. The matching algorithm applied may include a recursive alignment (RA) matching algorithm.

DTW 매칭 알고리즘은 DP 매칭 알고리즘을 사용하므로, DP(Dynamic programming) 매칭 알고리즘에 대하여 설명한다. DP 매칭은 일반적으로 입력 음악 정보 (R = [r _o , r ₁ ... r _NR-1 ]) 와 저장된 음원(Q = [q _o , q ₁ ... q _NQ-1 ])사이에는 동일 음악이라고 할지라도 길이차가 발생하더라도 이를 고려하여 양 음원의 유사도를 계 산하는 알고리즘이다.Since the DTW matching algorithm uses the DP matching algorithm, a dynamic programming (DP) matching algorithm will be described. DP matching is generally the same between the input music information (R = [ r _o , r ₁ ... r _NR-1 ]) and the stored sound source (Q = [ q _o , q ₁ ... q _NQ-1 ]). Even if it is music, it is an algorithm that calculates the similarity of both sound sources considering the difference in length.

도 9는 동적 정합 매칭 알고리즘을 설명하기 위한 도면이다. 9 is a diagram for explaining a dynamic matching matching algorithm.

동적 정합 매칭(DP 매칭)은 도 9와 같이 서로 길이가 다른 두 패턴간의 유사도를 측정하기 위한 방법으로, 삽입 (insertion) 및 제거(deletion)등의 방법으로 경로 별 패턴 유사도를 측정할 수 있는 방법이다. 이 방법을 이용하여 매칭을 할 시에는 도 9에서 보여 지는 것처럼 총 5가지의 제약 조건이 존재한다. 첫째로, 끝점 제약 조건으로 입력 패턴, 참조 패턴의 시작점과 끝점을 일치시키고 비교가 이루어지므로 정확한 음성 구간 검출을 요구한다. 둘째는 단조 증가 제약 조건으로 최적 경로는 항상 단조 증가 하여야 한다는 것이다. 세 번째 조건은 국부 연속 제약 조건으로 격자상의 한 노드에 도달하기 위한 경로에 제한을 두어 시간상에서 지나치게 수축되거나 팽창하는 것을 막아야 한다는 것이다. 네 번째는 전역 경로 제약 조건으로, 서로 다른 길이를 갖는 입력 음성 패턴과 참조 패턴간의 전 구간에 걸친 허용 가능한 영역을 제한하는 조건이다. 마지막으로 기울기 가중치 조건으로, 국부 경로의 비용 계산 시, 모두 동일한 가중치를 주지 않고 기울기에 따라서 서로 다른 가중치를 주어, 시간에 비해 비합리적으로 변하는 것을 방지해야만 한다.Dynamic matching matching (DP matching) is a method for measuring similarity between two patterns having different lengths as shown in FIG. 9, and a method of measuring pattern similarity for each path by an insertion and removal method. to be. When matching using this method, there are a total of five constraints as shown in FIG. First, since the matching between the start point and the end point of the input pattern and the reference pattern is performed as the end point constraint, accurate speech section detection is required. The second is the monotonically increasing constraint that the optimal path should always be monotonically increased. The third condition is local constrained constraints that restrict the path to reach a node on the grid to prevent it from contracting or expanding in time. The fourth is a global path constraint, which restricts the allowable area over the entire interval between the input speech pattern and the reference pattern having different lengths. Finally, as the slope weighting condition, when calculating the cost of the local path, all of them should be given different weights according to the slope without giving the same weight to prevent them from changing irrationally with time.

일반적으로 허밍과 같은 입력 음악 정보는 저장된 음원 데이터의 임의 부분에 대한 정보를 포함할 수 있으므로, 도 10과 같이 입력 음악 정보 (R) 값이 이동(sliding)하며 비교를 하게 된다. In general, since the input music information such as the humming may include information on an arbitrary portion of the stored sound source data, the input music information R values are slid and compared as shown in FIG. 10.

아래 수학식은 두 정보간의 상이도 측정을 계산하는 수식의 일례이다.The following equation is an example of an equation for calculating the difference measurement between two pieces of information.

[수학식1][Equation 1]

[수학식1]에서 r _i (m)은 음원 데이터베이스에 저장된 비교 대상 음원 정보이며, q _j (m-ps)는 입력 오디오 신호 정보이다. 이때 q _j (m)을 ps 만큼 sliding 해 가면서 [수학식1]에 의해 두 정보 (r _i (m)와 q _j (m))간의 상이도를 측정하게 된다. DP 매칭 알고리즘에 대한 자세한 내용은 공지되어 있으므로 자세한 설명은 생략한다.In Equation 1, r _i (m) is comparison target sound source information stored in the sound source database, and q _j (m-ps) is input audio signal information. At this time, as a ps q _j (m) sliding going to be measured different from the level between the Equation 1 two pieces of information (r _i (m) and q _j (m)) by. Details of the DP matching algorithm are well known, and thus detailed descriptions thereof will be omitted.

위에서 언급했듯이, 입력 허밍과 저장된 음원 사이에 발생하는 길이차를 고려하여 DP를 이용한 매칭 방법을 많이 사용한다. DTW를 이용하기 위하여서는 위에서 설명된 것처럼 5가지의 제약조건은 필수적이며, 그중 첫 번째 조건은 반드시 지켜져야 한다.As mentioned above, the matching method using DP is often used in consideration of the length difference generated between the input hum and the stored sound source. To use DTW, five constraints are essential, as described above, the first of which must be observed.

DTW의 첫 번째 제약조건은 바로, 비교하고자 하는 두 파형의 시작점과 끝점이 같아야 한다는 것이다. The first constraint of the DTW is that the start and end points of the two waveforms to be compared must be the same.

도 11은 첫 번째 조건의 예를 표현한 도면이다. (a)를 입력신호로 가정하였을 때, (b)는 (a)를 2배로 늘린 신호이고, (c)는 (a)의 뒤에 추가적으로 다른 신호가 더 추가된 신호이다. 여기서 (a)를 입력데이터로 가정하고, (b)와 (c)를 저장되 어있는 데이터라고 가정하자. 눈으로 보았을 때는, (c)가 (a)와 똑같은 신호를 포함하고 있기 때문에 (a)와 (b)의 매칭 결과 보다는 (a)와 (c)의 매칭 결과가 좋게 나올 것이라는 생각을 할 수 있다. 하지만 (a)와 (c)의 매칭의 경우 DTW의 첫 번째 제약조건이 무시되므로 상이도가 좋게 나올 수가 없다. DTW의 장점인, 매칭 시 삽입과 삭제가 자유로이 일어난다는 것은 시작점과 끝점이 같은 조건 아래에 성립되는 것으로, 이것은 (b)와의 매칭을 통해서 확인할 수 있다. 실제로 (a)와 (b)를 매칭하게 될 경우, 시작점과 끝점이 일치하며, DTW의 특징인 삽입 및 삭제 중, 삽입을 통해서 0에 가까운 상이도를 확인 할 수 있다. 따라서 허밍 데이터와 같은 입력 오디오 신호가 들어올 경우, 음원에 대해서 시작점과 끝점이 정확히 어디인지, 이러한 구간 검출 과정이 반드시 필요하다.11 is a diagram representing an example of the first condition. Assuming that (a) is an input signal, (b) is a signal in which (a) is doubled, and (c) is a signal in which another signal is added after (a). Let's assume (a) as input data and (b) and (c) as stored data. Visually, since (c) contains the same signal as (a), one might think that the matching results of (a) and (c) will be better than the matching results of (a) and (b). . However, in case of the matching of (a) and (c), the first constraint of DTW is ignored, so the difference is not good. The advantage of DTW, that insertion and deletion occurs freely during matching, is established under the same condition of starting point and ending point, which can be confirmed by matching with (b). In fact, when (a) and (b) are matched, the start point and the end point coincide, and during insertion and deletion, which are the characteristics of the DTW, it is possible to check the difference near zero through insertion. Therefore, when an input audio signal such as a humming data is input, this section detection process is absolutely necessary to determine exactly where the start point and the end point of the sound source are input.

DTW의 다섯 가지 제약조건 중 네 번째 조건은 전역 경로 제약 조건이다. 이는 입력 데이터와 음원 데이터를 매칭 할 때, 전 구간에 걸친 허용 가능한 영역을 어느 정도 제한하여 준다는 조건이다. 사람이 어떠한 음을 허밍이나 노래로 부를 때에는 보통 음원과의 속도차이가 그렇게 심하게 발생하지 않는다. 즉, 음원대비로 허밍파형이 심하게 수축되거나 팽창되는 일이 드물기 때문에, 굳이 모든 영역에 대한 검색을 허용할 필요가 없다. 예를 들어, 도 12와 같이 입력 데이터의 길이가 5이고 음원 데이터의 길이가 5라고 가정을 한다면, 총 25개의 탐색 영역이 나오게 된다. 이 때, 사람이 음원보다 5배 빠르게 혹은 느리게 허밍을 하는 극단적인 경우는 없기 때문에, 모든 영역을 다 계산할 필요가 없어진다. 따라서 도 12의 빨간 선과 같이 어느 정도의 경로 탐색에 제약을 주어 그 영역 내에서만 매칭을 허용하게 된다면 불필요한 계산 량을 덜어내어 처리 시간을 더 빠르게 할 수 있으며, 또한 매칭 정확도 역시 높일 수 있다. The fourth of the five constraints of the DTW is the global path constraint. This is a condition that limits the allowable area over the entire range when matching the input data with the sound source data. When a person hums or sings a sound, the speed difference with the sound source usually does not occur so much. That is, since the humming waveform rarely contracts or expands significantly compared to the sound source, it is not necessary to allow the search of all regions. For example, if it is assumed that the length of the input data is 5 and the length of the sound source data is 5 as shown in FIG. 12, a total of 25 search areas appear. At this time, since there is no extreme case where a person humming five times faster or slower than a sound source, it is not necessary to calculate all the regions. Therefore, if the path search is restricted to some extent as shown in the red line of FIG. 12, and the matching is allowed only in the region, the processing time can be shortened and the matching accuracy can be increased.

본 발명에서는 허밍과 같은 정보는 저장된 음원 데이터의 임의 부분에 대한 정보를 포함할 수 있으므로 도 13과 같은 방법으로 입력 데이터가 이동하면서 비교를 수행하도록 한다. 이 때, 음원 데이터에서 매칭하고자 하는 영역의 크기는 입력 데이터의 길이를 기준으로 선택을 하는 것이 바람직하다.In the present invention, the information such as the humming may include information on any portion of the stored sound source data, so that the comparison is performed while the input data moves in the same manner as in FIG. 13. In this case, the size of the region to be matched in the sound source data is preferably selected based on the length of the input data.

다음으로 EMD 알고리즘에 대하여 설명하기로 한다. EMD 알고리즘은 선형 계획법 중 하나인 수송문제(Transportation Problem)에 대한 특별한 해결방법으로, 여러 분야에서 효과적으로 쓰이고 있다. 도 14는 수송문제의 예를 보여주고 있는데, EMD 알고리즘의 경우 로 표현되는 수송거리와 공급량의 계산을 통해 가장 최적화된 방법을 찾아낸다. 또한 전체의 데이터에 적용이 가능할 뿐만 아니라, 부분적인 영역의 매칭 역시 가능하다. 또한 true-metric 방법으로 계산 량이 적고, 쉽게 구현할 수 있어서, 일반적으로 다양한 분포(multimodal distribution)를 구분하고 매칭하는데 사용 된다.Next, the EMD algorithm will be described. The EMD algorithm is a special solution to the transportation problem, one of the linear programming methods, and is effectively used in various fields. Fig. 14 shows an example of a transport problem. The EMD algorithm finds the most optimized method by calculating the transport distance and the supply amount expressed by. Not only can it be applied to the whole data, but also partial region matching is possible. In addition, since the calculation amount is small and easy to implement in the true-metric method, it is generally used to distinguish and match a multimodal distribution.

EMD 알고리즘에서는 공급자와 수요자가 있다. 예를 들어, 수요자가 구멍을 메우기 위해 필요로 하는 최소한의 흙을 공급자가 정확히 퍼다 나른다고 가정을 한다면, 최종적으로 EMD 알고리즘은 필요한 흙의 양을 퍼서 나르는 일의 양을 측정한다. EMD 알고리즘이 유한한 이산분포의 두 점을 측정한다고 할 때에 두 점은 아래 의 수식에서 x, y로 표현된다.In the EMD algorithm, there are suppliers and consumers. For example, suppose that a supplier accurately delivers the minimum amount of soil a consumer needs to fill a hole. Finally, the EMD algorithm measures the amount of work carried out by digging up the amount of soil needed. If the EMD algorithm measures two points in a finite discrete distribution, they are represented by x and y in the equation below.

[수학식 2][Equation 2]

위 식의 w와 u는 x와 y값에 주어진 가중치의 양을 의미한다. 를 x의 가중치의 총 합, 를 y의 가중치의 총합이라고 할 때에, 그림 38과 같은 경우에서는 가중치의 합이 모두 1이다. 또한, x는 총 2개, y는 총 3개의 값을 갖고, 점을 둘러싼 원의 넓이는 가중치에 비례하며, 같은 가중치를 가진 두 값 사이의 EMD 알고리즘의 결과 값은 각각이 하는 일의 양과 비례한다. In the above formula, w and u are the amount of weight given to x and y values. Let s be the sum of the weights of x and s are the sum of the weights of y. In the case of Figure 38, the sum of the weights is all 1. Also, x has 2 values and y has 3 values. The area of the circle surrounding the point is proportional to the weight, and the result of the EMD algorithm between the two values with the same weight is proportional to the amount of work each is doing. do.

xi 에서 yj로 수송된 양을 fij로 표현하고 이를 flow라고 부른다. dij는 xi와 yj사이의 거리를 나타낸다. 이것을 이용하여 각 위치에서 운송을 했을 때의 총 수송량은 아래의 식을 이용하여 구할 수 있다. The quantity transferred from xi to yj is expressed in fij and is called flow. dij represents the distance between xi and yj. By using this, the total transport volume at each location can be calculated using the following equation.

[수학식3]&Quot; (3) "

도 15의 예를 위의 식에 대입해보면 0.23 * 155.7 + 0.51 * 252.3 + 0.26 * 316.30 = 246.7 의 총 수송량을 구할 수 있다. 하지만 그림 38은 EMD 알고리즘의 최적화 된 결과가 아니고, 실제 EMD 알고리즘에서는 x와 y가 같은 가중치를 갖는 경우가 거의 없다.Substituting the example of FIG. 15 into the above equation, the total transport volume of 0.23 * 155.7 + 0.51 * 252.3 + 0.26 * 316.30 = 246.7 can be obtained. However, Figure 38 is not an optimized result of the EMD algorithm, and in the actual EMD algorithm, x and y rarely have the same weight.

도 16은 x와 y가 각각 다른 가중치의 합을 갖는 경우를 보여준다. 실제 EMD 알고리즘이 사용되는 경우에서는 위와 같은 경우가 대부분이다. 이 경우의 총 수송량을 구하면 0.23 * 155.7 + 0.25 * 252.3 + 0.26 * 198.2 = 150.4 의 값을 갖게 된다. EMD 알고리즘은 최소량의 일을 가장 무거운 위치에서 가장 가벼운 위치로 옮겨온 값으로 나누어준다. 즉 완벽하게 매칭 된 경우, 최적화된 총 수송량을 총 공급량으로 나누어서 정규화를 해주는 것이다. 이것을 식으로 나타내면 아래와 같다.16 shows a case in which x and y each have a different sum of weights. In the case where the actual EMD algorithm is used, the above cases are mostly used. The total transport in this case is 0.23 * 155.7 + 0.25 * 252.3 + 0.26 * 198.2 = 150.4. The EMD algorithm divides the least amount of work by the value moved from the heaviest position to the lightest position. In the case of a perfect match, the optimized total transport is divided by the total supply and normalized. This is expressed as follows.

[수학식4]&Quot; (4) "

결과적으로 도 16은 가장 작은 공급량인 = 0.74를 이용해서 EMD 알고리즘의 결과 값을 구하면 EMD(x, y) = 150.4 / 0.74 = 203.3의 값을 가지게 된다.As a result, when the result of the EMD algorithm is obtained using the smallest supply amount = 0.74, the value of EMD (x, y) = 150.4 / 0.74 = 203.3 is obtained.

위에 나열한 계산방법을 가진 EMD 알고리즘을 QbSH 시스템에 적용할 경우 여 러 가지 장점을 얻을 수 있다. 첫째로 EMD 알고리즘은 top-down 방식을 사용한다. Bottom-up 방식을 사용하는 스트링매칭의 경우, 삽입과 삭제를 수행하면서 시간과 비용이 소비되게 되고, 수행되는 부분의 처음과 마지막과 시간이 지연되는 곳에서 에러가 발생하게 되는데, EMD 알고리즘의 경우에는 이러한 문제에 강인한 특성을 가진다. 둘째로 sliding 매칭이나, DTW와 비교했을 때 더 적은 계산 량을 가지기 때문에 처리시간이 빠르다. 예를 들어, 입력 데이터를 방대한 양의 음원 데이터와 매칭 시켜야 할 때, 복잡하고 시간이 많이 소비되는 알고리즘의 경우에는 처리시간이 오래 걸려서 실시간 적으로 사용하기에 불가능하지만, EMD 알고리즘의 경우에는 실시간 프로그램에 사용할 수 있다. 셋째로 데이터를 단위로 묶거나, 부분으로 나누어서 처리 할 수 있다. 이 경우 데이터를 어느 정도로 자세하게 분석할지를 결정할 수 있기 때문에 EMD 알고리즘은 글로벌하게 사용하거나, 로컬하게 사용할 지를 결정할 수 있다. 또한 대부분의 bottom-up 방식에 기반 한 note-based 경우 pitch와 tempo정보를 둘 다 보존하지 못하고 어느 정도의 손실이 일어나게 되는데, EMD 알고리즘을 글로벌 영역으로 사용할 때에는 긴 시간 동안의 tempo와 duration 정보를 보존할 수 있어서 pitch 와 tempo 정보를 글로벌하게 보고 측정할 때 사용하기 적합하다.There are several advantages to applying the EMD algorithm with the above-mentioned calculation methods to the QbSH system. Firstly, the EMD algorithm uses a top-down approach. In the case of string matching using the bottom-up method, time and money are consumed while inserting and deleting, and an error occurs at the beginning, end, and time delay of the part to be executed. Has characteristics that are robust to this problem. Second, the processing time is faster because of the smaller amount of calculation compared to sliding matching or DTW. For example, when an input data needs to be matched with a large amount of sound source data, a complicated and time-consuming algorithm takes a long time to process and cannot be used in real time, but an EMD algorithm provides a real-time program. Can be used for Third, data can be grouped or divided into parts. In this case, the EMD algorithm can determine whether to use the data globally or locally because it can determine how detailed the data is analyzed. In addition, most bottom-up-based note-based cases do not preserve both pitch and tempo information, and some loss occurs. When using the EMD algorithm as a global domain, tempo and duration information for a long time is preserved. It can be used to view and measure pitch and tempo information globally.

다음으로 LS 알고리즘에 대하여 설명하기로 한다.Next, the LS algorithm will be described.

LS(linear scaling) 알고리즘은 입력 데이터의 pitch 벡터를 몇몇의 시간에 따라 선형적으로 줄이거나 늘려가면서 데이터베이스에 저장되어 있는 음원과 비교 하는 방법으로, 멜로디 인식을 위한 가장 간단한 방법으로 알려져 있다. LS 알고리즘을 통해 매칭을 시도하게 되면, 입력데이터가 원래 길이의 0.5배에서 2.0배 까지 길이를 변화를 주게 된다. 도 17은 입력데이터를 LS알고리즘을 사용하여 각각 0.5, 0.75, 1, 1.25 그리고 1.5 배로 줄이거나 늘려가면서 비교한 것을 보여준다. LS (linear scaling) algorithm is known as the simplest method for melody recognition as it compares the pitch vector of the input data with the sound source stored in the database while decreasing or increasing the linearly with some time. When the matching is attempted through the LS algorithm, the input data varies in length from 0.5 times to 2.0 times the original length. FIG. 17 shows the comparison of the input data by decreasing or increasing the input data by 0.5, 0.75, 1, 1.25 and 1.5 times using LS algorithm, respectively.

도 17에 도시된 바와 같이, 입력데이터를 1.25배로 늘려 주었을 때 음원과 가장 비슷한 것을 볼 수 있고, 이때의 유클리디언 거리 (Euclidean distance)가 가장 작은 것을 확인할 수 있다. As shown in FIG. 17, when the input data is increased by 1.25 times, the most similar to the sound source can be seen, and the Euclidean distance at this time is the smallest.

다음으로 서로 다른 매핑 알고리즘을 적용한 결과 산출된 복 수의 매칭 스코어를 입력으로 하여 SVM을 적용하여 최종 유사도 측정값을 산출하는 기능을 설명하기로 한다.Next, a function of calculating a final similarity measure by applying an SVM with inputs of a plurality of matching scores calculated by applying different mapping algorithms will be described.

도 18은 3차원 공간에서의 SVM 분류기를 도시한 것이다.18 shows an SVM classifier in three-dimensional space.

도 18에 도시된 바와 같이, DTW, EMD, LS 매칭 알고리즘을 적용한 매칭스코어를 입력으로 하는 경우 각각 3차원 상의 공간에서의 벡터로 표현된다. 도 13에서 파란색 벡터들은 동일음을 비교하였을 때 추출된 정합값들에 의한 벡터이고, 빨간색 벡터들은 다른음을 비교하였을 때 정합값들에 의한 벡터이다. 노란색으로 표현된 부분은 트레이닝을 통해 얻어진 분류기(classifier)를 도시한 것이다.As shown in FIG. 18, when the matching scores to which the DTW, EMD, and LS matching algorithms are applied are input, they are represented as vectors in a three-dimensional space. In FIG. 13, blue vectors are vectors based on matching values extracted when the same sounds are compared, and red vectors are vectors based on matching values when different sounds are compared. The yellow part shows the classifier obtained through training.

SVM은 정확한 분류기를 얻기 위해 트레이닝 과정을 거치게 된다. 입력받은 오디오 신호에 대하여 상술한 서로 다른 매칭 알고리즘을 적용하여 얻은 매칭 정합 값들을 이용하여 정합값 벡터를 구한다. 이 때 동일음 비교에 의한 정합값 벡터와 다른음 비교에 의한 정합값 벡터 분포를 구한다. 동일음 비교의 경우 원 음원과의 상이도가 작을 것이므로 측정된 거리값이 작으므로 도 18에서 원점에 가까운 벡터 분포가 얻어질 것이고, 다른음 비교에 의한 벡터 정합값들은 상이도가 클 것이므로 원점으로부터 상대적으로 멀리 떨어져 있게 된다.SVM goes through training to get the correct classifier. A matching value vector is obtained using matching matching values obtained by applying the above-described different matching algorithms to the input audio signal. At this time, the matched value vector obtained by comparing the same sound and the matched value vector obtained by comparing the different sounds is obtained. In the case of equality comparison, since the difference between the original sound source will be small, the measured distance value will be small, and thus, the vector distribution close to the origin will be obtained in FIG. Relatively far away.

동일음 비교와 다른음 비교를 통해 얻어진 도 18에 도시된 바와 같은 정합값 벡터들은 SVM 트레이닝 과정을 거쳐서 도 19에 도시된 바와 같은 두 개의 분포를 구분하는 최적의 분류기로 구분될 수 있다. 즉, 트레이닝 과정에서 동일음 비교에 의한 정합값들은 -1로 출력값을 지정하고, 다른음 비교에 의한 정합값들은 1로 출력값을 지정하여 최적의 분류기를 얻을 수 있다. 이러한 정합값들을 입력으로 한 SVM 출력값이 이루는 분포는 이상적인 경우 도 19와 같이 2개의 분포를 가지며, 두 개의 분포를 최소의 오차를 가지도록 나누는 임계치를 이용하여 동일음 비교와 다른음 비교를 구분하게 된다. 즉, 트레이닝 과정으로 얻어진 유사도 측정값의 두 분포로부터 동일음 비교 에러와 다른음 비교 에러가 동시에 최소가 되는 지점으로 임계치를 설정하여 분류기를 획득할 수 있다.The matched value vectors as shown in FIG. 18 obtained by comparing the same and different sounds may be classified into an optimal classifier that distinguishes two distributions as shown in FIG. 19 through an SVM training process. That is, in the training process, the matching values by comparing the same tones may be assigned an output value of −1, and the matching values by comparing different tones may be specified as an output value of 1 to obtain an optimal classifier. In the ideal case, the distribution formed by the SVM output values using the matched values has two distributions as shown in FIG. 19, and distinguishes equal and different sound comparisons by using a threshold that divides the two distributions with a minimum error. do. That is, the classifier may be obtained by setting a threshold value from the two distributions of the similarity measurement values obtained through the training process to the point where the same sound comparison error and the other sound comparison error become minimum at the same time.

즉, 트레이닝 데이터에서 이러한 임계치가 결정되면, 이후 테스트 데이터에 대한 SVM 출력값이 이 임계치 이하인 경우 동일음 비교 (즉, 입력 오디오 신호와 현재 매칭하고 있는 음원이 같은 음으로 판정)로 결정하고, SVM 출력값이 임계치 이상인 경우 다른음 비교 (즉, 입력 오디오 신호와 현재 매칭하고 있는 음원이 다른 음으로 판정)로 결정하게 된다. 즉, 유사도 측정값이 작을수록 음원의 유사도는 높은 것이 된다.That is, when such a threshold value is determined in the training data, if the SVM output value for the test data is less than or equal to this threshold, it is determined by equality comparison (that is, the sound source currently matching the input audio signal is determined to be the same sound), and the SVM output value is determined. If it is above this threshold, it is determined by comparison of other sounds (i.e., the sound source currently matching the input audio signal is determined to be different sound). In other words, the smaller the similarity measured value, the higher the similarity of the sound source.

사용자의 허밍을 통한 질의 데이터가 입력되면 위의 세 가지 정합 방법(DTW, EMD, LS)을 통해 음원 데이터베이스들과의 정합을 수행한다. 이를 통해 3가지 정합값들을 얻고, 이를 기 학습된 SVM에 입력하면 1가지 출력값이 얻어지고, 이를 그림 2와 같은 분포에서 기 결정된 임계치에 따라 허밍과 비교된 음원이 동일곡인지 아닌지의 여부를 결정한다.When the query data through the user's humming is input, the matching with the sound source databases is performed through the above three matching methods (DTW, EMD, LS). Through this, three matching values are obtained and inputted into the pre-learned SVM to obtain one output value, and it is determined whether or not the sound source compared with the humming is the same song according to the predetermined threshold in the distribution as shown in Fig. 2. do.

즉, SVM에 의해 유사도 측정값을 구하여 유사도 측정값이 작은 순서대로 전체 음원에 대한 매칭 순위를 결정하고 유사도 측정값이 가장 작은 음원부터 추출함으로써 일치하는 음원을 검색할 수 있다.That is, the similarity measurement value is obtained by the SVM to determine the matching rank for all sound sources in the order of the smallest similarity measurement value, and the matching sound source can be searched by extracting the sound source having the smallest similarity measurement value.

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. So far I looked at the center of the preferred embodiment for the present invention.

본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Those skilled in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential features of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

도 3은 입력된 오디오 신호와 비교 대상 음원의 주파수 변화를 도시한 것이다.3 illustrates a change in frequency of an input audio signal and a comparison target sound source.

도 9 및 도 10은 동적 정합 매칭 알고리즘을 설명하기 위한 도면이다.9 and 10 are diagrams for describing a dynamic match matching algorithm.

도 11 내지 도 13은 DTW 매칭 알고리즘을 설명하기 위한 도면이다.11 to 13 are diagrams for explaining a DTW matching algorithm.

도 14는 EMD 매칭 알고리즘을 설명하기 위한 수송문제의 예를 도시한 것이다.14 shows an example of a transport problem for explaining the EMD matching algorithm.

도 15는 Equal-weight 분포 EMD 알고리즘의 예를 도시한 것이다.15 shows an example of an Equal-weighted distribution EMD algorithm.

도 16은 Unequal-weight 분포 EMD 알고리즘의 예를 도시한 것이다.16 shows an example of an Unequal-weight distributed EMD algorithm.

도 17은 LS 알고리즘의 사용 예를 설명하기 위해 도시한 것이다.17 illustrates an example of using an LS algorithm.

도 18 및 도 19는 본 발명에 따른 SVM을 설명하기 위해 도시한 것이다.18 and 19 illustrate the SVM according to the present invention.

Claims

An audio signal input unit for receiving an audio signal;

An audio signal analyzer configured to analyze the audio signal and extract feature values of the audio signal;

An audio signal correction unit configured to correct an audio signal by inputting a feature value of the audio signal;

A first output value is calculated by applying a first matching algorithm as a feature value of the corrected audio signal, and a second matching algorithm different from the first matching algorithm is obtained by inputting the feature value of the corrected audio signal. Calculates a second output value by using the first output value and the second output value as an input, calculates a similarity measurement value, and uses the calculated similarity measurement value to search for and extract from a sound source database from a sound source with high similarity A sound source search system incorporating a plurality of matching algorithms, characterized in that it comprises a signal matching search unit.

The method of claim 1, wherein the audio signal corrector,

And a zero removal unit for performing a correction to remove a sound having a pitch of zero from the analyzed audio signal.

The method of claim 1, wherein the audio signal corrector,

And a filter unit for filtering the analyzed audio signal in units of predetermined filter lengths.

The method of claim 3, wherein the filter unit,

A median filter unit for performing median filtering on the analyzed audio signal in units of predetermined filter lengths; And

And a plurality of matching algorithms for performing average filtering on a predetermined filter length unit for the audio signal to which the median filtering is applied.

The method of claim 2, wherein the audio signal correction unit,

And a scaling unit for scaling the pitch value of the filtered audio signal using the maximum and minimum values of the pitch.

The method of claim 1,

The characteristic value of the corrected audio signal includes pitch values of the corrected audio signal,

The audio signal matching search unit,

The first output value is calculated by applying the DTW matching algorithm by using the feature value of the corrected audio signal, and the second output value is calculated by applying the EMD matching algorithm by using the feature value of the corrected audio signal as input. A third output value is calculated by applying an LS matching algorithm using the feature value of the audio signal, and the similarity measurement value is calculated by inputting a first output value, a second output value, and a third output value. Sound source retrieval system combining number matching algorithm.

The method of claim 1, wherein the audio signal matching search unit,

And a plurality of matching algorithms as outputs by receiving at least two outputs calculated by applying the matching algorithms and combining them using a support vector machine (SVM).

The method of claim 7, wherein the SVM,

And a classifier obtained based on data obtained through a training process, wherein the classifier determines whether the sound is the same according to the similarity measure.

The method of claim 8, wherein the classifier,

A sound source retrieval system incorporating a plurality of matching algorithms, wherein the equality comparison error and the different tone comparison error are obtained from a region where the similarity comparison error and the different tone comparison error are minimum from the distribution of the result of the similarity measurement obtained through the training process.

Receiving an audio signal;

Analyzing the audio signal to extract feature values of the audio signal;

Correcting the audio signal by inputting a feature value extracted from the analyzed audio signal;

Calculating a first output value by applying a first matching algorithm as a feature value of the corrected audio signal, and calculating a second output value by applying a second matching algorithm as a feature value of the corrected audio signal as an input. ; And

And searching for and extracting a sound source having the highest similarity from a sound source database by using the similarity measure calculated by combining the first output value and the second output value. .

The method of claim 10, wherein correcting the audio signal comprises:

Removing a pitch of zero pitch from the analyzed audio signal;

Filtering the audio signal from which zero tones have been removed; And

And scaling a pitch value of the filtered audio signal using the maximum and minimum values of the pitch.

The method of claim 11, wherein the filtering comprises:

Performing median filtering on the audio signal from which the zero sound is removed in predetermined filter length units; And

And performing a mean filtering on a predetermined filter length unit for the audio signal to which the median filtering is applied.

The method of claim 10,

And output values calculated by applying the matching algorithms are combined using a support vector machine (SVM).

A computer-readable medium having recorded thereon a program for executing the method of claim 10.

The method according to any one of claims 1 to 9,

The first matching algorithm and the second matching algorithm may be any one of a DTW matching algorithm, an EMD matching algorithm, or an LS matching algorithm, and the first matching algorithm and the second matching algorithm may be different matching algorithms. Sound source retrieval system combining number matching algorithm.