KR101041037B1

KR101041037B1 - Method and Apparatus for speech and music discrimination

Info

Publication number: KR101041037B1
Application number: KR1020090017109A
Authority: KR
Inventors: 육동석; 양경철
Original assignee: 고려대학교 산학협력단
Priority date: 2009-02-27
Filing date: 2009-02-27
Publication date: 2011-06-14
Also published as: KR20100098100A

Abstract

본 발명은 음성과 음악을 구분하는 방법에 관한 것으로, 본 발명의 일 실시 예에 따른 음성과 음악을 구분하는 방법은 입력되는 신호에서 일정 구간 마다 피크 주파수 변화량의 평균을 산출하는 단계; 및 상기 피크 주파수 변화량이 임계값 이상이면 상기 신호를 음성으로 분류하고, 상기 피크 주파수 변화량이 임계값 미만이면 상기 신호를 음악으로 분류하는 단계를 포함한다. 본 발명에 의하면, 음성과 음악을 구분함에 있어서 빠른 응답을 유지하면서도 높은 성능을 구현할 수 있는 효과가 있다.The present invention relates to a method for distinguishing between voice and music, and a method for distinguishing between voice and music according to an embodiment of the present invention includes the steps of: calculating an average of a peak frequency change amount at a predetermined interval in an input signal; And classifying the signal as voice when the amount of change in peak frequency is greater than or equal to a threshold, and classifying the signal as music when the amount of change in peak frequency is less than or equal to a threshold. According to the present invention, there is an effect that can implement a high performance while maintaining a quick response in distinguishing voice and music.

Description

Method and Apparatus for Distinguishing Speech and Music {Method and Apparatus for speech and music discrimination}

본 발명은 음성인식 시스템의 전처리 과정에 관한 것으로, 특히, 음성과 음악을 구분하는 방법 및 장치에 관한 것이다.The present invention relates to a preprocessing process of a voice recognition system, and more particularly, to a method and apparatus for distinguishing between voice and music.

개인용 컴퓨터의 고성능화, 대용량 저장 장치의 보편화 및 월드와이드웹(World Wide Web; WWW)으로 대변되는 컴퓨터 네트워크의 발전에 따라 디지털로 표현되는 멀티미디어 정보의 생성, 전송, 가공이 매우 용이해졌다. 빠르게 진보하는 정보환경 속에서 엄청난 속도로 증가하는 멀티미디어 정보 중에서 사용자가 필요로 하는 내용의 정보를 찾기 위해 기존의 기반 검색은 효과적이지 않다. 따라서, 사용자가 원하는 정보를 내용에 기반하여 검색할 수 있는 방법이 요구되고 있다. The high performance of personal computers, the widespread use of mass storage devices, and the development of computer networks represented by the World Wide Web have made it very easy to create, transmit, and process digitally represented multimedia information. Existing base search is not effective to find the information of the contents that users need from the rapidly increasing multimedia information in the information environment. Therefore, there is a demand for a method for searching for information based on contents desired by a user.

최근 각종 응용에서 멀티미디어 데이터 중에서 오디오 신호의 음성과 음악을 구분하는 연구가 계속 진행되고 있다. 특히, 음성 인식 시스템의 응용 분야가 넓어지면서, 실제 생활 환경에서도 좋은 성능을 얻기 위한 전처리 방법이 많은 관심을 받고 있다. 음성 인식을 위한 전처리 응용 분야도 다시 세분화 되면서 방송과 같은 음악 환경에서 음성을 음악으로부터 구분해 내는 방법에 대한 연구가 진행되고 있 다.Recently, researches for distinguishing voice and music of an audio signal from multimedia data have been continuously conducted in various applications. In particular, as the application field of the speech recognition system is widened, a preprocessing method for obtaining a good performance in a real life environment is receiving much attention. As the preprocessing application field for speech recognition is further subdivided, research is being conducted on the method of separating voice from music in a music environment such as broadcasting.

기존의 SMD (Speech and Music Discrimination) 방법을 살펴보면 음악의 주요 특성이라고 생각될 수 있는 시간에 따라 변하는 리듬을 이용하여 음성과 음악을 구분하는 방법들이 제안되었다. 이러한 방법들은 대체로 음악은 음성의 변화에 비해 상대적으로 느리며 비교적 일정한 간격으로 변한다는 원리를 사용하였기 때문에, 음악의 종류에 따라 템포가 빨라지거나 사용하는 악기가 변화하면 그 성능이 크게 변할 수밖에 없다. Looking at the conventional SMD (Speech and Music Discrimination) method, a method of dividing voice and music using rhythm that changes with time, which can be considered as the main characteristic of music, has been proposed. These methods generally use the principle that music is relatively slow and change at relatively regular intervals compared to the change of voice, so the performance of the instrument changes greatly when the tempo increases or the instrument used varies depending on the type of music.

2007년에 발표된 논문 "Speech/music discrimination for robust speech recognition in robots" (M. Y. Choi, H. J. Song, and H. S. Kim, IEEE International Symposium on Robot and Human Interactive Communication, pp. 118-121)에서는 일정한 프레임 사이의 켑스트럼 거리의 최소값의 평균 (Mean of Minimum Cepstral Distance; MMCD)이 작으면 음성으로 분류하고 크면 음악으로 구분하였다. 또한, 스펙트럼 플럭스(Spectral Flux)의 경우 프레임 사이의 스펙트럼 에너지 차이를 구하여 음성과 음악을 구분하기도 하였다. 이러한 방법들은 비교적 좋은 성능을 보였지만, 빠른 응답을 얻지 못하는 단점을 갖고 있다.In a paper published in 2007, Speech / music discrimination for robust speech recognition in robots (MY Choi, HJ Song, and HS Kim, IEEE International Symposium on Robot and Human Interactive Communication , pp. 118-121), The mean of Minimum Cepstral Distance (MMCD) was classified as voice if small and MMCD as music. In the case of the spectral flux, the spectral energy difference between the frames was obtained to distinguish between the voice and the music. While these methods performed relatively well, they suffered from the lack of fast response.

따라서, 본 발명이 이루고자 하는 첫 번째 기술적 과제는 빠른 응답을 유지하면서 높은 성능을 구현할 수 있는 음성과 음악을 구분하는 방법을 제공하는 데 있다.Therefore, the first technical problem to be achieved by the present invention is to provide a method for distinguishing between voice and music that can implement high performance while maintaining a quick response.

본 발명이 이루고자 하는 두 번째 기술적 과제는 빠른 응답을 유지하면서 높은 성능을 구현할 수 있는 음성과 음악을 구분하는 장치를 제공하는 데 있다.The second technical problem to be achieved by the present invention is to provide a device for distinguishing between voice and music that can implement high performance while maintaining a quick response.

상기의 첫 번째 기술적 과제를 이루기 위하여, 본 발명의 일 실시 예에 따른 음성과 음악을 구분하는 방법은 입력되는 신호에서 일정 구간 마다 피크 주파수 변화량의 평균을 산출하는 단계; 및 상기 피크 주파수 변화량이 임계값 이상이면 상기 신호를 음성으로 분류하고, 상기 피크 주파수 변화량이 임계값 미만이면 상기 신호를 음악으로 분류하는 단계를 포함한다.In order to achieve the first technical problem, a method of distinguishing between voice and music according to an embodiment of the present invention comprises the steps of calculating an average of the peak frequency change amount for each interval in the input signal; And classifying the signal as voice when the amount of change in peak frequency is greater than or equal to a threshold, and classifying the signal as music when the amount of change in peak frequency is less than or equal to a threshold.

본 발명의 일 실시 예에 따른 음성과 음악을 구분하는 방법은 상기 분류된 결과에 따라, 상기 신호로부터 음성으로부터 분류된 신호를 추출하는 단계를 더 포함할 수 있다.According to an embodiment of the present disclosure, the method for distinguishing between voice and music may further include extracting a classified signal from the voice based on the classified result.

바람직하게는, 상기 피크 주파수 변화량은 d(t,b)는 시간 t에서 밴드 b의 주파수 변화량이며, f(t,b)는 시간 t에 밴드 b의 피크 주파수일 때, 수학식

에 따라 연산되는 순간 주파수 변화량을 이용하여 산출될 수 있 다. 여기서, 상기 순간 주파수 변화량은 f _max 가 변화 제한 폭일 때,

이면, 0의 값을 갖도록 할 수 있다.Preferably, the peak frequency change amount is d ( t , b ) is the frequency change amount of the band b at time t , f ( t , b ) is the peak frequency of the band b at time t ,

It can be calculated using the instantaneous frequency change amount calculated in accordance with. Here, the instantaneous frequency change amount is when f _max is the change limit width,

In this case, the value 0 can be set.

상기의 첫 번째 기술적 과제를 이루기 위하여, 본 발명의 다른 실시 예에 따른 음성과 음악을 구분하는 방법은 음성과 음악이 혼합된 신호를 복수의 밴드로 구분하는 단계; 상기 각 밴드마다 일정 구간 동안의 피크 주파수 변화량의 평균을 산출하는 단계; 및 상기 피크 주파수 변화량이 임계값 이상인 밴드의 신호를 음성으로 분류하고, 상기 피크 주파수 변화량이 임계값 미만인 밴드의 신호를 음악으로 분류하는 단계를 포함한다.In order to achieve the first technical problem, a method for distinguishing between voice and music according to another embodiment of the present invention comprises the steps of: dividing a signal mixed with voice and music into a plurality of bands; Calculating an average of a change in peak frequency for a predetermined period for each band; And classifying a signal of a band having a peak change amount greater than or equal to a threshold as voice and classifying a signal of a band having a peak change amount less than a threshold as music.

바람직하게는, 상기 복수의 밴드는 주파수 축 상에서 125Hz 간격으로 분할된 주파수 대역일 수 있다.Preferably, the plurality of bands may be frequency bands divided at 125 Hz intervals on a frequency axis.

바람직하게는, 상기 복수의 밴드는 270Hz부터 3010Hz 사이의 주파수 대역을 일정 간격으로 분할한 주파수 대역일 수 있다.Preferably, the plurality of bands may be a frequency band obtained by dividing a frequency band between 270 Hz and 3010 Hz at predetermined intervals.

에 따라 연산되는 순간 주파수 변화량을 이용하여 산출될 수 있다.Preferably, the peak frequency change amount is d ( t , b ) is the frequency change amount of the band b at time t , f ( t , b ) is the peak frequency of the band b at time t ,

It can be calculated using the instantaneous frequency change amount calculated according to.

여기서, 상기 순간 주파수 변화량은 f _max 가 변화 제한 폭일 때,

이면, 0의 값을 갖도록 할 수 있다.Here, the instantaneous frequency change amount is when f _max is the change limit width,

In this case, the value 0 can be set.

바람직하게는, 상기 피크 주파수 변화량의 평균을 산출하는 단계에서, 각 밴 드마다 평균 에너지를 산출하고, 상기 산출된 평균 에너지가 상기 신호의 전체 평균 에너지의 일정 비율 이하인 밴드의 순간 주파수 변화량을 0으로 계산할 수도 있다.Preferably, in calculating the average of the peak frequency change amount, the average energy is calculated for each band, and the instantaneous frequency change amount of the band whose calculated average energy is equal to or less than a predetermined ratio of the total average energy of the signal is zero. You can also calculate

상기의 두 번째 기술적 과제를 이루기 위하여, 본 발명의 일 실시 예에 따른 음성과 음악을 구분하는 장치는 음성과 음악이 혼합된 신호를 복수의 밴드로 구분하고, 상기 각 밴드마다 일정 구간 동안의 피크 주파수 변화량의 평균을 산출하는 하모닉스 변화 추적부; 및 상기 피크 주파수 변화량이 임계값 이상인 밴드의 신호를 음성으로 분류하고, 상기 피크 주파수 변화량이 임계값 미만인 밴드의 신호를 음악으로 분류하는 과정을 각 밴드별로 수행하는 음성 음악 구별부를 포함한다.In order to achieve the second technical problem, an apparatus for distinguishing between voice and music according to an embodiment of the present invention divides a signal in which voice and music are mixed into a plurality of bands, and each peak has a peak for a predetermined period. A harmonics change tracking unit for calculating an average of the frequency change amount; And a voice music discriminator configured to classify signals of bands having a peak change amount greater than or equal to a threshold as voice and to classify signals of bands having a peak change amount less than a threshold as music for each band.

본 발명에 의하면, 음성과 음악을 구분함에 있어서 빠른 응답을 유지하면서도 높은 성능을 구현할 수 있는 효과가 있다.According to the present invention, there is an effect that can implement a high performance while maintaining a quick response in distinguishing voice and music.

이하에서는 도면을 참조하여 본 발명의 바람직한 실시 예를 설명하기로 한다. 그러나, 다음에 예시하는 본 발명의 실시 예는 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 다음에 상술하는 실시 예에 한정되는 것은 아니다.Hereinafter, with reference to the drawings will be described a preferred embodiment of the present invention. However, embodiments of the present invention illustrated below may be modified in various other forms, and the scope of the present invention is not limited to the embodiments described below.

주파수 분석을 통해 음성과 음악의 특성을 살펴보면, 대부분 악기는 특정 주파수 소리를 지속적으로 내도록 고안되어 있다는 것을 알 수 있고, 음성은 조음 현 상에 의해서 점차적인 주파수 변화가 발생하는 것을 알 수 있다. Looking at the characteristics of voice and music through frequency analysis, it can be seen that most instruments are designed to produce a certain frequency sound continuously, and the voice shows a gradual frequency change caused by articulation.

본 발명에서는 이러한 음성과 음악이 갖고 있는 주파수 변화 특성을 이용하여 음성과 음악을 구별한다. 즉, 음성과 음악을 구분하기 위한 특성 값으로서 주파수 변화율을 사용한다. In the present invention, the voice and the music are distinguished using the frequency change characteristic of the voice and the music. That is, the frequency change rate is used as a characteristic value for distinguishing between voice and music.

이하에서는 음성과 음악의 스펙트로그램 상에서의 특성 차이를 비교하여 주파수 도메인에서 음성과 음악을 구분하기 위한 특성을 분석한다.Hereinafter, a characteristic for distinguishing between speech and music in the frequency domain is analyzed by comparing the characteristic difference on the spectrogram of speech and music.

도 1a 및 1b는 단모음 '아'(aa)와 '이'(iy)의 스펙트로그램에 각 밴드의 최고 에너지 값을 실선으로 표시한 그래프이다. 1A and 1B are graphs showing solid energy peak values of respective bands in spectrograms of short vowels 'aa' and 'y'.

각 밴드에서 최고 에너지 값을 스펙트럴 피크 (spectral peak)라고 한다. 도 1a 및 1b의 '아'와 '이'는 어느 정도 일정한 주파수 간격으로 스펙트럴 피크가 발생한다. The highest energy value in each band is called the spectral peak. 'A' and 'I' in FIGS. 1A and 1B generate spectral peaks at a somewhat constant frequency interval.

도 1a의 '아'의 경우, 음소가 지속되는 중간 부분은 상대적으로 피크의 변화가 작고 음소가 시작되는 부분과 특히 끝나는 부분에서는 변화가 크게 발생하는 것을 볼 수 있다. 이 경우는 화자가 음소의 중간 부분은 일정한 주파수의 소리를 냈지만 발성하는 전후 과정에서 조음 기관을 움직여 다른 주파수로 전이하는 발성을 했기 때문이다.In the case of 'a' of FIG. 1A, it can be seen that the middle portion where the phoneme is continued has a relatively small change in the peak, and the change occurs largely at the beginning and particularly the ending part of the phoneme. In this case, the speaker produced a certain frequency in the middle part of the phone, but in the process of vocalization before and after vocalization, the articulation organ moved to another frequency.

도 1b의 '이'의 경우에는 단모음이지만 음소가 발성되는 전 과정에서 피크의 변화가 발생한다. 많은 경우 화자가 의도적으로 해당 주파수를 유지하지 않으면 단모음이 발성되는 동안에도 높은 주파수 또는 낮은 주파수로 점차적으로 움직여가는 주파수 변화가 쉽게 발생한다. 이와 같이 각각의 음소를 발성하기 위해 발성 기관 을 움직이는 순간마다, 밴드 별 에너지가 최고인 지점의 주파수가 점차적으로 변화하는 것을 알 수 있다. 즉, 음성은 하모닉스의 변화가 매 순간 점차적으로 발생한다. 그 이유는 음성은 성도를 통해서 조음 기관이 변형될 때 주파수 변화가 발생하는데, 사람은 조음 기관을 움직이며 소리를 발성하므로 소리를 변화시킬 때마다 주파수 대역이 연속적으로 변화하기 때문이다. 이는 특히 악기가 기계적으로 단절된 주파수의 소리를 순간적으로 내는 것과는 구별된다. In the case of 'I' of FIG. 1B, the peak is changed in the whole process of the phoneme being spoken. In many cases, if the speaker does not intentionally maintain the frequency, it is easy to change the frequency gradually moving to high or low frequencies even during short vowels. In this way, it can be seen that the frequency of the point where the energy of each band is the highest gradually changes every time the vocal organ is moved to produce each phoneme. In other words, the voice gradually changes every second. The reason is that the voice changes in frequency when the articulation organ is transformed through the vocal tract, because the person moves the articulation organ and utters the sound, so each time the sound is changed, the frequency band changes continuously. This is particularly distinguished from the fact that the instrument produces a momentary sound of a mechanically disconnected frequency.

도 2는 연속 문장의 스펙트로그램에 밴드별 피크를 표시한 그래프이다. 2 is a graph showing peaks for each band in a spectrogram of a continuous sentence.

도 2에 사용된 연속 문장은 "She had your dark suit in greasy wash water all year" 이다. 연속 문장에서는 다양한 주파수 변화를 볼 수 있다. 즉, 현재 음소에서 다음 음소로 변화해 가면서 주파수 변화가 점차적으로 발생하고, 이미 살펴 본 단모음의 경우와 같이 음소가 시작하거나 끝나는 부분에서도 주파수 변화가 발생한다.The continuous sentence used in FIG. 2 is "She had your dark suit in greasy wash water all year". In the continuous sentence, you can see various frequency changes. That is, the frequency change gradually occurs from the current phoneme to the next phoneme, and the frequency change occurs at the beginning or the end of the phoneme as in the case of the short vowels.

도 3a는 기타 연주곡의 스펙트로그램에 스펙트럴 피크를 표시한 그래프이다.3A is a graph showing spectral peaks in a spectrogram of a guitar performance song.

이 경우, 밴드 별 스펙트럴 피크가 일정하게 유지되다가 순간적으로 변화되는 것을 볼 수 있다. 일반적으로 악기들은 일정한 주파수의 소리를 내도록 고안되어 있으며, 연주된 소리가 특정 주파수를 일정한 시간 동안 지속하다가 새로운 음이 발생할 때 다른 주파수로 순간 변화하게 된다. 즉, 음악에서는 음성과 같이 점진적 주파수 변화는 발생하지 않는다. In this case, it can be seen that the spectral peak for each band remains constant and then changes instantaneously. In general, musical instruments are designed to sound at a certain frequency, and the sound played lasts a certain frequency for a certain amount of time, and then changes instantly to another frequency when a new note occurs. That is, music does not have a gradual frequency change like voice.

도 3b는 비교적 빠르고 강한음의 드럼 연주곡의 스펙트로그램에 스펙트럴 피크를 표시한 그래프이다. 3B is a graph showing spectral peaks in a spectrogram of a relatively fast and strong drum performance.

드럼과 같은 타악기의 경우 주파수 변화 현상은 거의 없다는 것을 알 수 있다.Percussion instruments such as drums can be seen that there is little frequency change.

이러한 분석을 통해 음악은 특정 주파수에서 시작하여 일정한 시간 동안 같은 주파수를 유지하는 반면, 음성의 경우 발성하는 매 순간마다 지속적으로 조금씩 변한다는 사실을 확인할 수 있다. 이하에서는 이러한 특성을 이용하여 음성과 음악을 구분하는 방법을 설명한다.This analysis shows that music starts at a certain frequency and stays at the same frequency for a certain amount of time, while voice changes continuously every minute of speech. Hereinafter, a method of distinguishing between voice and music using these characteristics will be described.

본 발명은 주파수 변화율을 이용하여 음성과 음악을 구분하는 STR (Spectral Transition Rate) 특징 기반 SMD 알고리즘을 제공한다.The present invention provides a STR (Spectral Transition Rate) feature-based SMD algorithm that distinguishes voice and music using a frequency change rate.

피크 주파수의 변화량은 아래 수학식 1과 같이 계산한다.The change amount of the peak frequency is calculated as in Equation 1 below.

여기서 d(t,b)는 시간 t에서 밴드 b의 주파수 변화량이며, f(t,b)는 시간 t에 밴드 b의 피크 주파수이고, f _max 는 변화 제한 폭이다. 순간 주파수 변화량은 시간 t와 t-1 사이의 주파수 변화량이다. 이때, 주파수 변화량이 f _max 이상이면 새로운 소리가 다른 주파수에 발생한 것으로 간주하고 d(t,b)를 0으로 계산하는 것이 바람직하다. 또한, 밴드의 평균 에너지가 전체의 평균 에너지에 비해 일정한 비율 이하인 경우에도 순간 변화량 d(t,b)를 0으로 계산하여 상대적으로 낮은 에너지를 갖는 주파수 대역의 변화를 제외 하도록 할 수도 있다.Where d ( t , b ) is the frequency change of band b at time t , f ( t , b ) is the peak frequency of band b at time t , and f _max is the change limit. The instantaneous frequency variation is the frequency variation between time t and t -1. At this time, if the amount of change in frequency is greater than or equal to f _max , it is preferable to consider that a new sound has occurred at another frequency and calculate d ( t , b ) as 0. In addition, even when the average energy of the band is less than a certain ratio with respect to the total average energy, the instantaneous change amount d ( t , b ) may be calculated as 0 to exclude a change in the frequency band having a relatively low energy.

수학식 2는 일정한 기간 동안의 피크 주파수 변화량이다.Equation 2 is the amount of change in peak frequency over a period of time.

여기서 STR(t)은 입력된 소리가 T 시간 동안 점차적으로 높은 주파수 대역으로 또는 낮은 주파수 대역으로 움직여가는 변화량이다. 순간 주파수 변화량인 d(t,b)를 각 밴드 별로 T까지 더한 후 그 제곱 값을 유효 밴드까지 합한 값이다. 여기서 start부터 end까지의 대역이 주파수 변화 현상을 관찰하는 유효 밴드 대역이다. 음성과 음악을 구분하는 특징 값을 만들기 위해서, 하모닉스의 변화를 일정한 밴드로 나누어 그 추이를 추적한다. 예를 들어, 평균 270Hz부터 3,010Hz 사이에 포만트 (formant) 주파수와 평균 피치를 고려하여, 주파수 대역을 125Hz 간격으로 나누어 스펙트럴 피크 값을 추적할 수 있다.Where STR ( t ) is the amount of change that the input sound moves to the higher or lower frequency band gradually during T time. The instantaneous frequency change d ( t , b ) is added to T for each band, and the squared value is summed to the effective band. Here, the band from start to end is an effective band band for observing a frequency change phenomenon. To create feature values that distinguish between voice and music, the harmonics change is divided into bands and tracked. For example, by considering the formant frequency and the average pitch between an average of 270 Hz and 3,010 Hz, the spectral peak value can be tracked by dividing the frequency band by 125 Hz intervals.

음성의 STR(t)은 소리가 변화하는 순간에 그 크기가 크게 나타나지만 소리가 유지되는 구간에서는 작게 나타날 수 있다. 이런 경우를 보상하기 위해 일정한 구간의 평균값을 사용한다. 수학식 3은 SMD 알고리즘에서 사용하는 최종 STR 값을 구하기 위한 식이다.The STR ( t ) of the voice may appear large at the moment when the sound changes, but may appear small at the interval where the sound is maintained. To compensate for this, the average value of a certain interval is used. Equation 3 is an equation for obtaining a final STR value used in the SMD algorithm.

여기서 W는 평균을 구하는 윈도우의 크기이다.Where W is the size of the window to be averaged.

도 4a 및 4b는 각각 도 1a 및 1b의 단모음 '아'(aa)와 '이'(iy)에 대한 STR 값을 도시한 그래프이다. 4A and 4B are graphs showing the STR values for the short vowels 'aa' and 'y' (iy) of FIGS. 1A and 1B, respectively.

단모음 '아'의 경우에는 주파수 변화 현상이 발성 시작 부분과 끝 부분에서 크게 발생하므로 시작하는 시점과 끝나는 시점에서 STR 값이 크게 나타난다. 단모음 '이'의 경우는 발성하는 과정에서 점차적으로 낮은 주파수로 변화하는 경우다. 주파수 변화 현상은 발성 초기에 크게 나타나므로 STR 값도 전반부에서 크게 나타난다.In the case of the short vowel 'ah', the frequency change occurs largely at the beginning and the end of the vocalization, so the STR value is large at the beginning and the end. The short vowel 'yi' is a case of gradually changing to a lower frequency during the vocalization process. Since the frequency change is large in the early stage of vocalization, the STR value is also large in the first half.

도 5는 연속 문장의 STR 값을 도시한 그래프이다. 5 is a graph illustrating STR values of consecutive sentences.

연속 문장에 대한 음성의 경우 다양한 주파수 변이 현상으로 인해 STR 값이 매 순간 크게 나타난다.In the case of speech for continuous sentences, the STR value is large every moment due to various frequency shifts.

도 6a 및 6b는 각각 도 3a의 기타 연주곡 및 3b의 드럼 연주곡의 STR 값을 도시한 그래프이다. 6A and 6B are graphs showing the STR values of the guitar music of FIG. 3A and the drum music of 3B, respectively.

기타와 드럼 연주곡 모두 음성과 비교하면 주파수 변화 현상이 적게 발생하므로 STR 값이 음성에 비해 상대적으로 작게 나타난다. Both the guitar and the drum music show less frequency change compared to the voice, so the STR value is relatively smaller than the voice.

본 발명의 일 실시 예에 따른 음성과 음악을 구분하는 방법에서는 테스트 데이터의 STR 값이 임계값 보다 이상이면 음성으로 분류하고, 작으면 음악으로 분류한다.In the method for distinguishing between voice and music according to an embodiment of the present invention, if the STR value of the test data is greater than or equal to the threshold value, the voice is classified as voice, and if it is small, it is classified as music.

도 7은 본 발명의 일 실시 예에 따른 음성과 음악을 구분하는 방법의 흐름도이다.7 is a flowchart illustrating a method for distinguishing between voice and music according to an embodiment of the present invention.

먼저, 입력되는 신호를 복수의 밴드로 구분한다(S710). 이때, 입력되는 신호 는 음성과 음악이 혼합된 신호일 수 있다. First, the input signal is divided into a plurality of bands (S710). In this case, the input signal may be a signal in which voice and music are mixed.

다음, 각 밴드마다 일정 구간 동안의 피크 주파수 변화량의 평균을 산출한다(S720). 이 과정(S720)은 입력 신호가 종료될 때까지 계속될 수 있으므로, 피크 주파수 변화량을 추적하는 과정이다.Next, the average of the peak frequency change amount for a predetermined period for each band (S720). This process (S720) can be continued until the input signal is finished, it is a process of tracking the peak frequency change amount.

다음, 산출되는 피크 주파수 변화량이 미리 정해진 임계값 이상인 밴드의 신호를 음성으로 분류한다(S730, S740). 한편, 산출되는 피크 주파수 변화량이 임계값 미만인 밴드의 신호는 음악으로 분류한다(S730, S745). 여기서, 비교 기준이 되는 임계값은 반복된 테스트 및 학습의 결과로부터 얻어진 학습 데이터를 이용하여 음성과 음악의 STR 분포를 구하고 음성과 음악의 구분 오차가 최소인 STR 값으로 정해질 수 있다. Next, a signal of a band having a calculated peak frequency change amount equal to or greater than a predetermined threshold value is classified into voice (S730 and S740). On the other hand, the signal of the band whose calculated peak frequency variation is less than the threshold is classified as music (S730, S745). Here, the threshold value used as a comparison criterion may be determined as a STR value obtained by obtaining STR distributions of voice and music using learning data obtained from the results of repeated tests and learning, and having a minimum division error between voice and music.

마지막으로, 신호의 입력이 종료되었는지 판단하고, 종료되지 않았으면, 피크 주파수 변화량을 추적하는 과정(S720)부터 시작하여 위의 과정을 반복한다.Finally, it is determined whether the input of the signal is terminated, and if it is not terminated, the above process is repeated starting from the process of tracking the peak frequency change amount (S720).

도 8은 본 발명의 일 실시 예에 따른 음성과 음악을 구분하는 장치의 블록도이다.8 is a block diagram of an apparatus for distinguishing between voice and music according to an embodiment of the present invention.

하모닉스 변화 추적부(810)는 입력되는 신호를 복수의 밴드로 구분하고, 상기 각 밴드마다 일정 구간 동안의 피크 주파수 변화량의 평균을 산출한다. 여기서, 입력되는 신호는 음성과 음악이 혼합된 신호일 수 있다. 하모닉스 변화 추적부(810)는 피크 주파수 변화량을 주기적으로 산출하거나 일정한 패턴으로 주어진 시간에 피크 주파수 변화량을 산출할 수도 있다.The harmonics change tracking unit 810 divides the input signal into a plurality of bands, and calculates an average of the peak frequency change amount for a predetermined period for each band. Here, the input signal may be a signal in which voice and music are mixed. The harmonics change tracking unit 810 may periodically calculate the peak frequency change amount or may calculate the peak frequency change amount at a given time in a predetermined pattern.

음성 음악 구별부(820)는 하모닉스 변화 추적부(810)에서 산출되는 피크 주 파수 변화량이 임계값 이상인 밴드의 신호를 음성으로 분류하고, 하모닉스 변화 추적부(810)에서 산출되는 피크 주파수 변화량이 임계값 미만인 밴드의 신호를 음악으로 분류한다.The voice music discriminator 820 classifies the signal of the band whose peak frequency change amount calculated by the harmonics change tracker 810 is greater than or equal to the threshold, and the peak frequency change amount calculated by the harmonics change tracker 810 is critical. Signals in bands below the value are classified as music.

음성 음악 구별부(820)의 음성과 음악의 구분에 따라 음성 신호와 음악 신호를 생성할 수 있다. 음성 추출부(830)와 음악 추출부(840)는 당업자의 필요에 따라 추가 또는 생략될 수 있다. 음성 추출부(830)는 음성 음악 구별부(820)에서 분류된 밴드의 신호들을 이용하여 음성 신호를 생성한다. 또한, 음악 추출부(840)는 음성 음악 구별부(820)에서 분류된 밴드의 신호들을 이용하여 음악 신호를 생성한다.The voice signal and the music signal may be generated according to the classification of the voice and the music of the voice music distinguisher 820. The voice extractor 830 and the music extractor 840 may be added or omitted as needed by those skilled in the art. The speech extractor 830 generates a speech signal using the signals of the band classified by the speech music discriminator 820. In addition, the music extractor 840 generates a music signal using the signals of the band classified by the voice music discriminator 820.

이하에서는 본 발명의 일 실시 예에 따른 음성과 음악을 구분하는 장치의 성능을 평가한다.Hereinafter, the performance of the apparatus for distinguishing between voice and music according to an embodiment of the present invention will be evaluated.

실험을 위한 음성 데이터로 TIMIT 데이터 베이스를 사용하였으며, 음악 데이터로는 여러 장르의 음악을 사용하였다. 소리는 16,000Hz 모노로 녹음하였으며, 푸리에 변환 (Fourier Transform) 윈도우의 크기는 128ms로, 10ms 간격으로 전진하며 SMD를 수행하였다. STR 계산에 사용된 유효 음성 주파수 대역으로는 125Hz에서 2,000Hz까지로 하였다.TIMIT database was used as voice data for the experiment and various genres of music were used as music data. The sound was recorded at 16,000Hz mono, and the size of Fourier Transform window was 128ms, which was advanced in 10ms intervals. The effective voice frequency band used for the STR calculation was 125 Hz to 2,000 Hz.

수학식 1에서 f _max는 70.3Hz으로 하였으며, 수학식 2에서 주파수 변화 계산을 위한 시간 T는 실험을 통해 최적한 값인 200ms을 사용하였다. STR의 평균 윈도우 W에 따라서 임계값을 구하여 실험하였다. 또한, STR의 빠른 응답에 대한 성능을 알아보기 위해, 기존 MMCD의 켑스트럼 거리 (cepstral distance) 계산을 위한 시간 보다 작은 T=150ms, W=150ms으로도 실험을 진행하였다.In Equation 1, f _max was 70.3 Hz, and the time T for calculating the frequency change in Equation 2 was an optimal value of 200 ms through experiments. The experiment was performed by calculating the threshold value according to the average window W of STR. In addition, to find out the performance of the STR's fast response, the experiment was also conducted with T = 150ms and W = 150ms, which is smaller than the time for calculating the cepstral distance of the conventional MMCD.

도 9는 평균 윈도우 W가 250ms인 경우, MMCD와 본 발명의 일 실시 예에 따른 STR의 음성과 음악의 장르별 SMD 성능을 비교한 그래프이다. 9 is a graph comparing SMD performance by genre of voice and music of the MMCD and the STR according to an embodiment of the present invention when the average window W is 250 ms.

MMCD는 음악의 종류에 따라서 성능의 변화가 크다. 반면, 본 발명의 일 실시 예에 따른 STR은 음악의 종류에 무관하게 상대적으로 안정된 성능을 보인다.MMCD varies greatly in performance depending on the type of music. On the other hand, STR according to an embodiment of the present invention shows a relatively stable performance regardless of the type of music.

도 10은 STR과 MMCD의 평균 SMD 성능을 도시한 그래프이다. 10 is a graph showing the average SMD performance of STR and MMCD.

본 발명의 일 실시 예에 따라 STR을 이용한 SMD는 평균 윈도우 W의 크기가 큰 경우 MMCD와 유사한 성능을 나타낸다. 윈도우 W가 크기가 작은 경우, 즉 빠른 응답에서도 MMCD에 비해 높은 성능을 보이는 것을 확인할 수 있다. 본 발명의 일 실시 예에 따른 알고리즘을 기존의 알고리즘과 비교할 때, 상대적으로 빠른 응답에서 좋은 성능을 보인다는 것이 확인된다.According to an embodiment of the present invention, the SMD using the STR exhibits similar performance to that of the MMCD when the average window W is large. It can be seen that the window W is smaller than the MMCD even in a quick response. When comparing an algorithm according to an embodiment of the present invention with an existing algorithm, it is confirmed that a good performance is shown in a relatively fast response.

본 발명은 소프트웨어를 통해 실행될 수 있다. 바람직하게는, 본 발명의 일 실시 예에 따른 음성과 음악을 구분하는 방법을 컴퓨터에서 실행시키기 위한 프로그램을 컴퓨터로 읽을 수 있는 기록매체에 기록하여 제공할 수 있다. 소프트웨어로 실행될 때, 본 발명의 구성 수단들은 필요한 작업을 실행하는 코드 세그먼트들이다. 프로그램 또는 코드 세그먼트들은 프로세서 판독 가능 매체에 저장되거나 전송 매체 또는 통신망에서 반송파와 결합된 컴퓨터 데이터 신호에 의하여 전송될 수 있다.The invention can be implemented via software. Preferably, a method for distinguishing between voice and music according to an embodiment of the present invention may be provided by recording a program for executing in a computer on a computer-readable recording medium. When implemented in software, the constituent means of the present invention are code segments that perform the necessary work. The program or code segments may be stored on a processor readable medium or transmitted by a computer data signal coupled with a carrier on a transmission medium or network.

컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기 록 장치의 예로는 ROM, RAM, CD-ROM, DVD±ROM, DVD-RAM, 자기 테이프, 플로피 디스크, 하드 디스크(hard disk), 광데이터 저장장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 장치에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Computer-readable recording media include all kinds of recording devices that store data that can be read by a computer system. Examples of computer-readable recording devices include ROM, RAM, CD-ROM, DVD ± ROM, DVD-RAM, magnetic tape, floppy disks, hard disks, and optical data storage devices. The computer readable recording medium can also be distributed over network coupled computer devices so that the computer readable code is stored and executed in a distributed fashion.

본 발명은 도면에 도시된 일 실시 예를 참고로 하여 설명하였으나 이는 예시적인 것에 불과하며 당해 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 실시 예의 변형이 가능하다는 점을 이해할 것이다. 그리고, 이와 같은 변형은 본 발명의 기술적 보호범위 내에 있다고 보아야 한다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해서 정해져야 할 것이다.Although the present invention has been described with reference to one embodiment shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and variations may be made therefrom. And, such modifications should be considered to be within the technical protection scope of the present invention. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

본 발명은 음성과 음악을 구분함에 있어서 빠른 응답을 유지하면서도 높은 성능을 구현할 수 있는 음성과 음악을 구분하는 방법 및 장치에 관한 것으로, 녹음 장치, 음향 편집 장치, 데이터 검색 방법, 음성인식 시스템의 전처리 장치 등에 적용될 수 있다.The present invention relates to a method and apparatus for distinguishing between voice and music that can implement high performance while maintaining a fast response in distinguishing between voice and music. The present invention relates to a recording apparatus, a sound editing apparatus, a data retrieval method, and a preprocessing of a speech recognition system. Applicable to the device and the like.

도 1a 및 1b는 단모음 신호의 스펙트로그램에 각 밴드의 최고 에너지 값을 실선으로 표시한 그래프이다. 1A and 1B are graphs showing solid energy peak values of respective bands in a spectrogram of a short vowel signal.

도 4a 및 4b는 각각 도 1a 및 1b의 단모음에 대한 STR 값을 도시한 그래프이다. 4A and 4B are graphs showing STR values for the short vowels of FIGS. 1A and 1B, respectively.

도 6a 및 6b는 각각 도 3a의 기타 연주곡 및 3b의 드럼 연주곡의 STR 값을 도시한 그래프이다.6A and 6B are graphs showing the STR values of the guitar music of FIG. 3A and the drum music of 3B, respectively.

도 9는 평균 윈도우 W가 250ms인 경우, MMCD와 본 발명의 일 실시 예에 따른 STR의 음성과 음악의 장르별 SMD 성능을 비교한 그래프이다.9 is a graph comparing SMD performance by genre of voice and music of the MMCD and the STR according to an embodiment of the present invention when the average window W is 250 ms.

Claims

Calculating a STR (Spectral Transition Rate) value of the input signal plus a peak frequency change amount for a predetermined time; And

Classifying the signal as voice when the STR value is greater than or equal to a threshold value, and classifying the signal as music when the STR value is less than a threshold value;

And the peak frequency change amount is calculated using an instantaneous frequency change amount calculated by differentiating a change in peak frequency with a time in a specific band.

The method of claim 1,

And according to the classified result, extracting a classified signal from the speech from the signal.

delete

The method of claim 1,

The instantaneous frequency change amount,

f ( t , b ) is the peak frequency of band b at time t and f _max is the change limit,

If it is, it has a value of 0, characterized in that the voice and music.

Dividing a signal mixed with voice and music into a plurality of bands;

Calculating a STR (Spectral Transition Rate) value of each band plus a change in peak frequency over a predetermined time period; And

Classifying a signal of a band having a STR value greater than or equal to a threshold as voice, and classifying a signal of a band having a STR value less than a threshold as music,

And the peak frequency change amount is calculated using an instantaneous frequency change amount calculated by differentiating a change in peak frequency in each band with time.

delete

The method of claim 5,

The plurality of bands,

A frequency band obtained by dividing a frequency band between 270 Hz and 3010 Hz at regular intervals.

delete

The method of claim 5,

The instantaneous frequency change amount,

If it is, it has a value of 0, characterized in that the voice and music.

The method of claim 5,

Calculating the average of the peak frequency change amount,

Calculating an average energy for each band; And

And calculating an instantaneous frequency change amount of a band in which the calculated average energy is equal to or less than a predetermined ratio of the total average energy of the signal as 0.

A program for executing in the computer system a method for distinguishing between voice and music according to any one of claims 1, 2, 4, 5, 7, 7, 9 or 10. Recordable media that can be read by a recorded computer system.

A harmonics change tracking unit for dividing a mixed voice and music signal into a plurality of bands and calculating a STR (Spectral Transition Rate) value for each band by adding a peak frequency change amount for a predetermined time period; And

Voice music discrimination unit performing a process for classifying a signal of a band having a STR value greater than or equal to a threshold as voice and classifying a signal of a band having a STR value less than a threshold as music for each band

Including,

And the peak frequency change amount is calculated using an instantaneous frequency change amount calculated by differentiating a change in peak frequency in a specific band with time.

delete