KR100901371B1

KR100901371B1 - A speech and music classification method for 3gpp2 smv codec using a support vector machine

Info

Publication number: KR100901371B1
Application number: KR1020080099029A
Authority: KR
Inventors: 장준혁; 김상균
Original assignee: 인하대학교 산학협력단
Priority date: 2008-10-09
Filing date: 2008-10-09
Publication date: 2009-06-05

Abstract

A voice and music classifying method of an SMV codec using an SVM for improving the voice and music classification performance is provided to contact the SVM in voice/music classification of the SMV code. A feature vector used in a voice/music classification algorithm of an SMV codec is selectively evaluated(S10). One of a movement average energy, a movement average reflection coefficient, and periodic coefficient is included. The feature vector saved is applied to SVM and searched between the training data. The classification focal plane is searched(S20). The voice/music is classified by using the classification focal plane(S30).

Description

A SPEECH AND MUSIC CLASSIFICATION METHOD FOR 3GPP2 SMV CODEC USING A SUPPORT VECTOR MACHINE}

본 발명은 음성/음악 분류 방법에 관한 것으로서, 특히 서포트 벡터 머신(Support Vector Machine; SVM)을 이용한 선택 모드 보코더(Selectable Mode Vocoder; SMV) 코덱의 음성/음악 분류 방법에 관한 것이다.The present invention relates to a voice / music classification method, and more particularly, to a voice / music classification method of a Selectable Mode Vocoder (SMV) codec using a support vector machine (SVM).

최근 IT기술의 발달로 이동통신기기 내에서의 다양한 멀티미디어 서비스가 본격적으로 사용화되기 시작하면서, 제한된 주파수 대역에서 효율적인 통신 환경을 구축하기 위한 연구가 활발히 진행되고 있다. 제한된 통신망을 효과적으로 사용하기 위하여 입력 음성 신호 특징에 따라 선택적으로 프레임마다 4단계로 나누어 전송률을 결정해 부호화하는 방식을 3GPP2의 표준 코덱인 선택 모드 보코더(Selectable Mode Vocoder; SMV)에서 사용하고 있다. 따라서 입력 음성 신호의 종류에 의해 매 프레임마다 전송률을 적절히 부여하는 것이 이동통신기기에서의 통화 음질을 결정짓는 중요한 과제이다. 특히, 최근의 이동통신 환경은 음성 전달에만 국한되는 것이 아니라 음악, 사진, 영상 등과 같이 다양한 멀티미디어 정보를 전송해야 하기 때문에 효과적으로 음성 및 음악을 분류하는 방법을 찾기 위한 연구가 활발히 진행되고 있다.Recently, due to the development of IT technology, various multimedia services in mobile communication devices are being used in earnest, and researches are being actively conducted to build an efficient communication environment in a limited frequency band. In order to effectively use a limited communication network, a selectable mode vocoder (SMV), which is a standard codec of 3GPP2, is used to determine and encode a transmission rate by dividing it into four steps for each frame according to characteristics of an input voice signal. Therefore, it is important to determine the call quality in a mobile communication device by appropriately assigning a transmission rate every frame according to the type of input voice signal. In particular, the recent mobile communication environment is not only limited to voice transmission, but various multimedia information such as music, photos, and videos have to be transmitted. Therefore, researches are actively being conducted to find a method of effectively classifying voice and music.

한편, 서포트 벡터 머신(Support Vector Machine; SVM)은 기존의 학습 방법과 다르게 패턴을 고차원 특징 공간으로 사상시킬 수 있다는 점과 대역적으로 최적의 식별이 가능할 뿐만 아니라 알려지지 않은 확률 분포를 갖는 데이터에 대하여 잘못 분류하는 확률을 최소화하는 구조적인 위험 최소화(Structural Risk Minimization) 방법에 기초하고 있다는 점에서 우수한 분류 방법으로서 주목받고 있다. 특히, SVM은 선형적으로 분류 가능한 데이터에 대한 이진 분류에 있어 두 개의 클래스를 분류할 수 있는 무수히 많은 초평면(Hyperplane) 중 클래스의 가장 가까운 점들과 마진이 최대가 되는 최적 초평면을 구함으로써 높은 일반화 성능을 기대할 수 있다.On the other hand, the support vector machine (SVM), unlike the conventional learning method, can map a pattern to a high-dimensional feature space, and can not only optimally identify in a band but also have an unknown probability distribution for data. It is attracting attention as an excellent classification method because it is based on the structural risk minimization method which minimizes the probability of misclassification. In particular, SVM provides high generalization performance by finding the optimal hyperplane that maximizes the margins and closest points of the class among the myriad of hyperplanes that can classify two classes in binary classification of linearly classifiable data. You can expect.

이와 같은 연구 결과들을 고려해 볼 때, SVM을 SMV 코덱의 음성/음악 분류 방법에 접목하여 이용함으로써, 음성/음악 분류 성능 향상을 시도해 볼 필요가 있다.Considering these findings, it is necessary to try to improve voice / music classification performance by using SVM in the voice / music classification method of SMV codec.

본 발명은, 상기와 같은 필요성의 인식에서 비롯된 것으로서, 서포트 벡터 머신(SVM)을 선택 모드 보코더(SMV) 코덱의 음성/음악 분류에 접목시킴으로써, 즉 기존의 SMV 인코딩 부분의 전처리 과정에서 자동으로 추출되는 파라미터 중 통계적 학습 분류 성능이 우수한 것들을 모아 별도의 계산 과정 없이 특징 벡터들로 이용하여 서포트 벡터 머신(SVM)을 적용함으로써, 향상된 성능의 음성 및 음악 분류 방법을 제안하는 것을 그 목적으로 한다.The present invention is derived from the recognition of the necessity as described above, by incorporating a support vector machine (SVM) into the speech / music classification of the selection mode vocoder (SMV) codec, that is, automatically extracted in the preprocessing of the existing SMV encoding part. The purpose of the present invention is to propose an improved speech and music classification method by applying a support vector machine (SVM) using the feature vectors without any additional calculation process by collecting the excellent statistical learning classification performance among the parameters.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 음성 및 음악 분류 방법은,Voice and music classification method according to a feature of the present invention for achieving the above object,

(1) 선택 모드 보코더(Selectable Mode Vocoder; SMV) 코덱의 음성/음악 분류 알고리즘에서 사용된 특징 벡터만을 선택적으로 구하는 제1 단계;(1) a first step of selectively obtaining only a feature vector used in a speech / music classification algorithm of a selectable mode vocoder (SMV) codec;

(2) 상기 제1 단계에서 구한 상기 특징 벡터를 통계적 학습이론인 서포트 벡터 머신(Support Vector Machine; SVM)을 이용하여 훈련 데이터 사이의 최적 분류 초평면을 찾아내는 제2 단계; 및(2) a second step of finding an optimal classification hyperplane between training data using the support vector machine (SVM), which is a statistical learning theory, of the feature vector obtained in the first step; And

(3) 상기 제2 단계에서 구한 상기 최적 분류 초평면을 이용하여 음성/음악을 분류하는 제3 단계를 포함하는 것을 그 구성상의 특징으로 한다.And (3) a third step of classifying voice / music using the optimum classification hyperplane obtained in the second step.

바람직하게는, 상기 제1 단계에서, 상기 특징 벡터에는, 이동 평균 에너지, 잡음 및 묵음의 이동 평균 반사계수, 부분적 잔류 에너지의 이동 평균, 정규화된 피치 상관도의 이동 평균, 주기적 계수, 음악 연속 계수의 이동 평균 중 적어도 하나 이상이 포함될 수 있다.Preferably, in the first step, the feature vector includes a moving average energy, a moving average reflection coefficient of noise and silence, a moving average of partial residual energy, a moving average of normalized pitch correlation, a periodic coefficient, and a music continuous coefficient. At least one or more of the moving average of may be included.

본 발명의 음성/음악 분류 방법에 따르면, 서포트 벡터 머신(SVM)을 선택 모드 보코더(SMV) 코덱의 음성/음악 분류 방법에 접목시킴으로써, 즉 기존의 SMV 인코딩 부분의 전처리 과정에서 자동적으로 추출되는 파라미터 중 통계적 학습 분류 성능이 우수한 것들을 모아 별도의 계산과정 없이 특징 벡터들로 이용하여 서포트 벡터 머신(SVM)을 적용함으로써, 음성 및 음악 분류 성능을 크게 향상시킬 수 있다.According to the speech / music classification method of the present invention, a parameter that is automatically extracted in the pre-processing of the existing SMV encoding part by combining the support vector machine (SVM) with the voice / music classification method of the selection mode vocoder (SMV) codec Among them, it is possible to greatly improve the speech and music classification performance by applying the support vector machine (SVM) using the feature vectors without any calculation process by collecting the excellent statistical learning classification performance.

이하에서는 첨부된 도면들을 참조하여, 본 발명에 따른 실시예에 대하여 상세하게 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 음성/음악 분류 방법의 구성을 나타내는 도면이다. 도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 음성/음악 분류 방법은, 특징 벡터 추출 단계(S10), 최적 분류 초평면 추출 단계(S20), 및 음성/음악 분류 단계(S30)를 포함한다.1 is a view showing the configuration of a voice / music classification method according to an embodiment of the present invention. As shown in FIG. 1, the speech / music classification method according to an embodiment of the present invention includes a feature vector extraction step S10, an optimal classification hyperplane extraction step S20, and a speech / music classification step S30. Include.

먼저, 특징 벡터 추출 단계(S10)에서는, 선택 모드 보코더(SMV) 코덱의 음성 /음악 분류 알고리즘에서 사용되어진 특징 벡터만을 선택적으로 구하게 된다. 추출되는 특징 벡터에는, 이동 평균 에너지, 잡음 및 묵음의 이동 평균 반사계수, 부분적 잔류 에너지의 이동 평균, 정규화된 피치 상관도의 이동 평균, 주기적 계수, 음악 연속 계수의 이동 평균 중 적어도 하나 이상이 포함될 수 있다.First, in the feature vector extraction step S10, only a feature vector used in the speech / music classification algorithm of the selection mode vocoder (SMV) codec is selectively obtained. The extracted feature vector includes at least one of a moving average energy, a moving average reflection coefficient of noise and silence, a moving average of partial residual energy, a moving average of normalized pitch correlation, a periodic coefficient, and a moving average of a music continuous coefficient. Can be.

다음으로, 최적 분류 초평면 추출 단계(S20)에서는, 단계 S10에서 구한 특징 벡터를 통계적 학습이론인 서포트 벡터 머신(SVM)에 적용하여 훈련 데이터 사이의 최적 분류 초평면을 찾아낸다.Next, in the optimal classification hyperplane extraction step (S20), the feature vector obtained in step S10 is applied to the support vector machine (SVM), which is a statistical learning theory, to find the optimal classification hyperplane between training data.

마지막으로, 음성/음악 분류 단계(S30)에서는, 단계 S20에서 구한 최적 분류 초평면을 이용하여 음성/음악을 분류한다.Finally, in the voice / music classification step S30, the voice / music is classified using the optimal classification hyperplane obtained in step S20.

본 발명의 상세한 설명에서는, 먼저 선별한 특징 벡터들을 이용하여 통계적 학습이론인 서포트 벡터 머신(SVM)을 이용하여 훈련 데이터 사이의 최적 분류 초평면을 찾는 과정부터 설명한 후, 이에 기초하여 음성/음악을 분류 방법을 상세히 설명하기로 한다.In the detailed description of the present invention, first, the process of finding the optimal classification hyperplane between training data using the support vector machine (SVM), which is a statistical learning theory using the selected feature vectors, is described. The method will be described in detail.

1. 최적 분류 초평면1. Optimal classification hyperplane

서포트 벡터 머신(SVM)의 학습 능률을 높이기 위해서는 최적의 초평면을 구해야 한다. 최적의 초평면을 구하는 과정은, 다음 수학식 1의 제약 조건을 가지 고, 수학식 2로 표현되는 마진의 역수가 최소가 되도록 하는 최적화 문제라고 할 수 있다.To increase the learning efficiency of support vector machines (SVMs), we need to find the optimal hyperplane. The process of obtaining an optimal hyperplane can be said to be an optimization problem that has the constraint of Equation 1 below and minimizes the inverse of the margin represented by Equation 2.

상기 수학식 1의 두 가지 조건식은, 다음 수학식 3과 같이 하나의 조건식으로 만들 수 있다.Two conditional expressions of Equation 1 may be made into one conditional expression as in Equation 3 below.

초평면에 대한 단위(Normal) 법선 벡터 w와 중심에서 초평면까지의 거리 b만 주어지면 최적 분류 초평면을 구할 수 있으므로, 모든 데이터 점이 정확히 어느 클래스에 속하는지 판별할 수 있고 마진의 폭도 계산할 수 있다. 모든 데이터 점에 적합하고 가장 넓은 마진을 이루는 최적의

와

은 라그랑지안 최적화(Lagrangian Optimization) 기법을 이용하여 목적식과 제약식을 결합한 후, 라그랑제 승수 α_i를 포함하여 다음 수학식 4로부터 구한다.Given only the normal normal vector w for the hyperplane and the distance b from the center to the hyperplane, the optimal classification hyperplane can be found, so that it is possible to determine exactly which class all data points belong to and to calculate the width of the margin. Optimal for all data points and with the widest margin

Wow

After combining the objective equation and the constraint equation using the Lagrangian Optimization technique, the equation is obtained from Equation 4 including the Lagrange multiplier α _i .

마진을 최대화하기 위하여, KKT(Karush-Kuhn-Tucker) 조건을 적용하여 다음 수학식 5 및 6에서 각각 최적 가중치 벡터

와 최적 바이어스

을 구한다.In order to maximize the margin, applying the Karush-Kuhn-Tucker (KKT) condition, the optimal weight vector in Equations 5 and 6, respectively

With optimal bias

Obtain

2. 최적 분류 초평면을 이용하여 음성/음악을 분류하는 방법2. Optimal Classification Method for classifying voice / music using hyperplane

(1) 음성/음악 판별함수(1) Voice / Music Discrimination Function

임의의 패턴 x가 주어질 때, 상기 수학식 5 및 6에서 구해진

과

을 사용하여 다음 수학식 7의 판별함수에 의해 분류 결과가 계산된다.Given a random pattern x, the equations 5 and 6

and

The classification result is calculated by using the discriminant function of Equation 7 below.

(2) 커널함수(Kernel Function)(2) Kernel Function

한편, 우리가 접하는 대부분의 패턴은 명확하게 선형 분리가 되지 않는 경우가 대부분이며, 음성 신호 또한 마찬가지이기 때문에 비선형 변환함수를 이용하여 보다 고차원의 공간으로 사상(Mapping)시킨 후 선형 분리를 적용할 필요가 있다. 사상된 공간에서도 원 공간에서의 거리 관계를 어느 정도는 보존시킬 필요가 있기 때문에, 사상함수를 이용하며 커널함수(Kernel Function)를 다음 수학식 8과 같이 수정한다.On the other hand, most of the patterns we encounter are not clearly linearly separated, and since voice signals are also the same, it is necessary to apply linear separation after mapping to a higher-dimensional space using a nonlinear transform function. There is. Since the distance relationship in the original space needs to be preserved to some extent in the mapped space, the kernel function is modified using the mapping function as shown in Equation 8 below.

다음 표 1은 커널함수의 종류와 각각의 커널함수의 종류에 따른 수학식을 나타낸 것이다.Table 1 below shows the types of kernel functions and mathematical expressions for each type of kernel function.

Kernel functionKernel function Type of ClassifierType of Classifier PolynomialPolynomial

RBF

Sigmoid

(3) 최종 음성/음악 판별함수(3) Final voice / music discrimination function

판별함수와 최적화 문제에 Φ(x)을 쓰지 않고 K()로만 나타낼 수 있는데 이러한 계산 회피 방법을 커널 트릭(Kernel Trick)이라 한다. 커널 트릭은 Φ가 존재할 수 있는 커널함수가 주어진 경우에만 유용하며, 상기 표 1에서와 같이 주어진다. 결론적인 비선형 SVM의 최종판별 함수는 다음 수학식 9와 같다.It is possible to represent only K () without using Φ (x) in the discriminant function and optimization problem. This method of avoiding computation is called Kernel Trick. Kernel tricks are only useful when given a kernel function where Φ can be present, as given in Table 1 above. The final discriminant function of the nonlinear SVM is as shown in Equation 9 below.

3. 실험 결과3. Experimental Results

본 발명을 위해서 사용된 음성 데이터베이스는 8 kHz로 샘플링된 약 6초 정도의 깨끗한 음성으로, 326명의 남자와 138명의 여자 화자에 의해서 각각 10개의 파일이 발음된 TIMIT 데이터베이스가 사용되었다. 음악 데이터베이스는 CD로부터 여러 장르의 음악을 모바일 폰을 통해서 녹음하여 8 kHz로 다운 샘플링되었으며, 5분 정도의 음악 파일이 사용되었다. 제안된 음성/음악 분류 알고리즘의 모델은 음성 파일 4200개와 음악 파일 60개(메탈 12개, 재즈 12개, 블루스 12개, 힙합 12개, 클래식 12개)를 이용하여 트레이닝하였다.The voice database used for the present invention was a clean voice of about 6 seconds sampled at 8 kHz. A TIMIT database in which 10 files were pronounced by 326 male and 138 female speakers was used. The music database was downsampled to 8 kHz by recording various genres of music from the CD on a mobile phone, and about 5 minutes of music files were used. The proposed voice / music classification algorithm was trained using 4200 voice files and 60 music files (12 metals, 12 jazz, 12 blues, 12 hip-hop, 12 classical).

SMV와 제안된 알고리즘의 객관적인 성능을 평가하기 위해서 테스트 파일을 만들었다. 동일한 데이터에 의한 성능 향상을 피하기 위해서 트레이닝에 사용된 음성/음악 데이터는 테스트에 사용하지 않았다. 테스트 파일은 5개 음성 파일(6∼12초), 5개 음악 파일(28∼32초), 10개 무음(3∼15초)을 사용하여 만들었다.A test file was created to evaluate the objective performance of the SMV and the proposed algorithm. The voice / music data used in the training was not used in the tests to avoid the performance gain by the same data. The test file was made using five voice files (6-12 seconds), five music files (28-32 seconds) and 10 silent (3-15 seconds).

다양한 음악 장르에 대한 음성/음악 분류 성능을 확인하기 위해서, 테스트 파일의 음악을 2가지 형태로 각 장르별(힙합, 메탈, 재즈, 블루스, 클래식)로 구성된 형태의 테스트 파일 60개, 음악 장르가 혼합된 형태의 테스트 파일 24개와 같이 총 84개의 테스트 파일을 만들었다. 두 시스템의 실제 성능을 알아보기 위해서 테스트 파일의 20ms마다 실제로 결과를 0(무음), 1(음성), 2(음악)로 수동으로 작성한 것과 비교하였다.In order to check the performance of voice / music classification for various music genres, 60 kinds of test files are composed of test files composed of two types of genres (hip hop, metal, jazz, blues, and classical) and music genres. A total of 84 test files were created, such as 24 test files. To see the actual performance of the two systems, we actually compared the results with manual writing of 0 (silent), 1 (voice) and 2 (music) every 20ms of the test file.

다음 표 2는 기존의 SMV와 제안된 SVM 기반의 알고리즘에서 음성/음악 검출 확률(P_d)을 나타낸다. SVM에서는 문턱 값의 변화에 따라 음성 또는 음악의 P_d 값을 조절할 수 있으므로 필요에 따라서 원하는 비율로 사용할 수 있다. 우측에 함께 표시된 오차 확률(Probability of Error; P_e)은 음성과 음악에 대한 미검출 확률(1-P_d)의 합이다. 표 2로부터 확인할 수 있는 바와 같이, 본 발명에서 제안하고 있는 음성/음악 분류 방법은, 전반적으로 우수한 분류 성능을 보였으며, 특히 메탈, 블루스, 힙합, 클래식, 혼합에서 뛰어난 성능을 보였다. 표 2의 결과로부터, 본 발명에서 제안된 음성/음악 분류 방법이 기존의 SMV 코덱의 음성/음악 분류 방법보다 훨씬 향상된 결과를 보인다는 것을 분명하게 확인할 수 있다.Table 2 below shows voice / music detection probability (P _d ) in the existing SMV and the proposed SVM-based algorithm. In SVM, the P _d of voice or music depends on the change in threshold. The value can be adjusted so that it can be used at the desired ratio if necessary. Probability of Error (P _e ), shown together on the right, is the sum of undetected probabilities (1-P _d ) for voice and music. As can be seen from Table 2, the speech / music classification method proposed by the present invention showed excellent classification performance in general, particularly in metal, blues, hip hop, classical music, and mixing. From the results of Table 2, it can be clearly seen that the speech / music classification method proposed in the present invention shows much improved results than the speech / music classification method of the existing SMV codec.

TESTTEST MethodMethod MusicMusic SpeechSpeech P_e P _e MetalMetal SMVSMV 0.220.22 0.910.91 0.440.44 ProposedProposed 0.900.90 0.920.92 0.090.09 BluesBlues SMVSMV 0.150.15 0.900.90 0.430.43 ProposedProposed 0.900.90 0.900.90 0.100.10 HiphopHiphop SMVSMV 0.280.28 0.900.90 0.370.37 ProposedProposed 0.660.66 0.900.90 0.180.18 JazzJazz SMVSMV 0.270.27 0.920.92 0.410.41 ProposedProposed 0.350.35 0.900.90 0.380.38 ClassicClassic SMVSMV 0.500.50 0.900.90 0.300.30 ProposedProposed 0.810.81 0.910.91 0.140.14 MixedMixed SMVSMV 0.210.21 0.930.93 0.430.43 ProposedProposed 0.720.72 0.900.90 0.190.19

이상 설명한 본 발명은 본 발명이 속한 기술분야에서 통상의 지식을 가진 자에 의하여 다양한 변형이나 응용이 가능하며, 본 발명에 따른 기술적 사상의 범위는 아래의 특허청구범위에 의하여 정해져야 할 것이다.The present invention described above may be variously modified or applied by those skilled in the art, and the scope of the technical idea according to the present invention should be defined by the following claims.

도 1은 본 발명의 일 실시예에 따른 음성/음악 분류 방법의 구성을 나타내는 도면.1 is a view showing the configuration of a voice / music classification method according to an embodiment of the present invention.

<도면 중 주요 부분에 대한 부호의 설명><Explanation of symbols for main parts of the drawings>

S10: 특징 벡터 추출 단계S10: feature vector extraction step

S20: 최적 분류 초평면 추출 단계S20: Optimal Classification Hyperplane Extraction Step

S30: 음성/음악 분류 단계S30: speech / music classification step

Claims

(1) a first step of selectively obtaining only a feature vector used in a speech / music classification algorithm of a Selectable Mode Vocoder (SMV) codec;

(2) a second step of finding an optimal classification hyperplane between training data by applying the feature vector obtained in the first step to a support vector machine (SVM), which is a statistical learning theory; And

(3) a third step of classifying voice / music using the optimum classification hyperplane obtained in the second step;

Speech and music classification method of the selection mode vocoder codec using a support vector machine, comprising a.

The method of claim 1,

In the first step, the feature vector includes a moving average energy, a moving average reflection coefficient of noise and silence, a moving average of partial residual energy, a moving average of normalized pitch correlation, a periodic coefficient, and a moving average of music continuous coefficients. Speech and music classification method of the selection mode vocoder codec using a support vector machine, characterized in that at least one or more are included.