KR100933946B1

KR100933946B1 - Feature vector extraction method using adaptive selection of frame shift and speaker recognition system thereof

Info

Publication number: KR100933946B1
Application number: KR1020070108929A
Authority: KR
Inventors: 최정윤; 강홍구; 정치상
Original assignee: 연세대학교 산학협력단
Priority date: 2007-10-29
Filing date: 2007-10-29
Publication date: 2009-12-28
Also published as: KR20090043203A

Abstract

본 발명은 입력 음성 신호의 길이에 대하여 현재 기준 분석 구간과 M개의 후보 분석 구간들을 설정하여 각 구간의 특징벡터를 추출하고, 현재 기준 분석 구간의 특징벡터와 각각의 후보 분석 구간의 특징벡터 간의 스펙트럼 차이가 최소인 후보 분석 구간을 선택한 후, 선택된 후보 분석 구간의 특징벡터를 추출함과 동시에 선택된 후보 분석 구간과 현재 기준 분석 구간 사이의 길이를 중첩길이로 설정하여 새로운 기준 분석 구간과 새로운 M개의 후보 분석 구간을 재설정하는 과정을 반복하면서 음성 분석 구간의 중첩길이를 가변적으로 선택하여 입력 음성 신호에 대한 특징벡터를 추출하는 음성 분석구간 중첩길이의 가변적 선택을 이용한 특징 벡터 추출 방법 및 이를 이용한 화자 인식 시스템에 관한 것이다.The present invention extracts the feature vector of each section by setting the current reference analysis section and the M candidate analysis sections with respect to the length of the input speech signal, and spectra between the feature vector of the current reference analysis section and the feature vector of each candidate analysis section. After selecting the candidate analysis section with the smallest difference, the feature vector of the selected candidate analysis section is extracted, and the length between the selected candidate analysis section and the current reference analysis section is set as the overlap length so that the new reference analysis section and the new M candidates are selected. Feature vector extraction method using variable selection of overlapping length of speech analysis section by extracting feature vector for input speech signal by variably selecting overlapping length of speech analysis section while repeating the process of resetting analysis section and speaker recognition system using same It is about.

본 발명에 따라 음성 분석 구간의 중첩길이를 가변적으로 선택하면 입력 음성 신호에 대한 화자의 다양한 특성을 충분히 뽑아내기 용이하고, 입력 음성에 들어있는 화자의 특성을 모두 모델링하는 것이 용이하므로, 기존의 음성 분석 구간의 중첩길이를 특정한 값으로 고정하는 특징벡터 추출 방법을 이용하는 화자 인식 시스템에 비해 상대적으로 더 월등하게 화자 인식의 성능을 향상시킬 수 있다.According to the present invention, if the overlapping length of the speech analysis section is variably selected, various characteristics of the speaker with respect to the input speech signal can be easily extracted, and it is easy to model all the characteristics of the speaker included in the input speech signal. Compared to the speaker recognition system using the feature vector extraction method which fixes the overlap length of the analysis section to a specific value, the performance of the speaker recognition can be improved more significantly.

화자 인식, 음성 분석, 중첩, 특징벡터, 화자 모델 Speaker recognition, speech analysis, superposition, feature vector, speaker model

Description

Feature vector extraction using variable selection of overlapping length of speech analysis section and speaker recognition system using same

본 발명은 화자 인식 기술에 관한 것이며, 더욱 상세히는 적절하게 음성 분석구간의 중첩길이를 선택하여 특징 벡터를 추출하는 방법 및 이 특징 벡터 추출 방법을 이용하여 화자를 인식하는 화자 인식 시스템에 관한 것이다.The present invention relates to a speaker recognition technique, and more particularly, to a method for extracting a feature vector by selecting an overlapping length of a speech analysis section and a speaker recognition system for recognizing a speaker using the feature vector extraction method.

일반적으로 화자 인식 시스템은 크게 세 부분, 즉 화자 종속 특징벡터(feature vector)를 추출하는 부분과 화자 종속 모델을 생성해 내는 화자 모델링 부분, 그리고 생성된 화자 종속 모델을 이용하여 실험된 화자의 인식을 처리하는 화자 인식 부분으로 나눌 수 있다.Generally speaking, the speaker recognition system has three main parts: extracting speaker-dependent feature vectors, speaker modeling part to generate speaker-dependent models, and experimenter recognition using the generated speaker-dependent model. It can be divided into speaker recognition part.

여기서, 상기 특징벡터 추출 부분에서는 멜 주파수 켑스트럼 계수(MFCC; Mel Frequency Cepstral Coefficient), 선형 주파수 켑스트럼 계수(LFCC; Linear Frequency Cepstral Coefficient), 선형 예측 켑스트럼 계수(LPCC; Linear Prediction Cepstral Coefficient), 라인 스펙트럼 주파수(LSF; Line Spectral Frequency) 등을 특징벡터로 사용하고 있다.Here, in the feature vector extraction portion, Mel Frequency Cepstral Coefficient (MFCC), Linear Frequency Cepstral Coefficient (LFCC), Linear Prediction Cepstral Coefficient (LPCC) Coefficient) and Line Spectral Frequency (LSF) are used as feature vectors.

또한, 상기한 특징벡터 추출을 위하여 사용되는 음성 분석구간 설정 방법 중 가장 일반적인 고정 음성 분석구간 설정 방법의 분석 길이는 20ms∼30ms이고, 이때 50% 정도의 중첩된 구간을 사용하는데, 이는 화자 인식 시스템의 연산의 복잡도를 고려한 것이다.In addition, the analysis length of the fixed speech analysis section setting method, which is the most common method of setting the speech analysis section used for extracting the feature vector, is 20ms to 30ms, and 50% of overlapping sections are used, which is a speaker recognition system. Considering the complexity of the operation.

종래의 화자 인식 시스템의 작동 과정을 간략하게 설명하면 다음과 같다.A brief description of the operation of the conventional speaker recognition system is as follows.

도 1은 고정 음성 분석구간 설정 방법을 사용하여 특징벡터를 추출하는 종래의 화자 인식 시스템의 블록도로서, 특징벡터 추출부(100)와 화자 모델 생성부(110) 및 화자 인식부(120)로 구성되고, 도 2는 고정 음성 분석구간 설정 방법을 사용하는 종래의 특징벡터 추출 방법의 실시예이다.1 is a block diagram of a conventional speaker recognition system extracting a feature vector using a method of setting a fixed speech analysis section. The feature vector extractor 100, the speaker model generator 110, and the speaker recognizer 120 are illustrated in FIG. 2 is an embodiment of a conventional feature vector extraction method using a fixed speech analysis section setting method.

상기 특징벡터 추출부(100)는 실험용 입력 음성 신호의 길이 또는 실제 화자 인식용 입력 음성 신호의 길이에 대하여 미리 설정된 단위(예컨대, 20ms∼30ms 단위)로 음성 분석 구간의 길이(L)를 설정하고, 각 음성 분석 구간을 이전 음성 분석 구간의 길이의 50%로 고정된 중첩 범위를 사용하여 이동시키면서 실험용 입력 음성 신호 또는 실제 화자 인식용 입력 음성 신호에 대한 각 음성 분석 구간의 특징벡터를 추출한다.The feature vector extractor 100 sets the length L of the voice analysis section in a predetermined unit (for example, 20 ms to 30 ms units) with respect to the length of the experimental input voice signal or the length of the actual speaker recognition input voice signal. In addition, a feature vector of each speech analysis section is extracted for an experimental input speech signal or an actual speech recognition input speech signal while moving each speech analysis section using a fixed overlapping range of 50% of the length of the previous speech analysis section.

예컨대, 도 2에서는 입력 음성 신호 길이에 대하여 길이(L)의 초기 음성 분석 구간(current frame)과 이 초기 음성 분석 구간의 길이의 50%로 고정된 중첩 범위를 사용하여 순차적으로 이동시켜 형성한 2개의 음성 분석 구간(next frame)으로부터 복수의 특징벡터(예컨대, 3개)를 추출하는 과정을 나타내고 있다.For example, in FIG. 2, 2 is formed by sequentially moving the input speech signal length using an initial speech analysis section of length L and an overlapping range fixed to 50% of the length of the initial speech analysis section. A process of extracting a plurality of feature vectors (eg, three) from four voice analysis intervals (next frames) is shown.

상기와 같이 특징벡터를 추출하고 나면, 다음으로 화자 모델 생성부(110)는 실험용 입력 음성 신호에 대해서 상기한 방법으로 추출된 특징벡터를 이용하여 가우시안 혼합 모델(GMM; Gaussian mixture model) 방식 등으로 각 화자에 대한 화자 모델을 만든다.After extracting the feature vectors as described above, the speaker model generator 110 uses a Gaussian mixture model (GMM) method using the feature vectors extracted by the above method for the experimental input speech signal. Create a speaker model for each speaker.

이와 같이 실험용 입력 음성 신호로부터 추출한 특징벡터를 이용하여 각 화자에 대응하는 화자 모델을 만들어 놓은 상태에서, 상기 특징벡터 추출부(100)가 실제 화자 인식용 입력 음성 신호로부터 특징벡터를 추출하면, 상기 화자 인식부(120)는 실제 화자 인식용 입력 음성 신호로부터 추출된 특징벡터를 상기 화자 모델 생성부(110)에 의해 생성된 각 화자 모델에 대응시켜 화자로 인식될 확률 값을 계산하고, 계산 결과 그 확률 값이 가장 큰 화자 모델의 화자를 실제 화자 인식용 입력 음성의 화자로 판별하게 된다.As described above, when the speaker vector corresponding to each speaker is made using the feature vectors extracted from the experimental input voice signal, the feature vector extractor 100 extracts the feature vector from the input voice signal for real speaker recognition. The speaker recognizer 120 calculates a probability value to be recognized as a speaker by mapping the feature vector extracted from the input speech signal for real speaker recognition to each speaker model generated by the speaker model generator 110 and calculating the result. The speaker of the speaker model with the largest probability value is identified as the speaker of the input voice for real speaker recognition.

한편, 음성 신호는 무성음, 유성음 등의 음성의 종류에 따라 그 특성이 다양하게 변하므로 음성 분석 구간 또한 다양하게 변해야 하는데, 기존의 음성 분석 구간의 중첩길이를 특정한 값(예컨대, 50%)으로 고정하는 방법은 이러한 음성의 다양한 특성 변화를 고려하지 못하는 문제점이 있다.On the other hand, since the voice signal has various characteristics depending on the type of voice such as unvoiced voice and voiced sound, the voice analysis section should also be varied. Fixed the overlap length of the existing voice analysis section to a specific value (for example, 50%). There is a problem in that the method does not consider the change in various characteristics of the voice.

실제로, 상기한 바와 같이, 도 1과 도 2를 참고하여 간략하게 설명한 종래의 화자 인식 시스템에서와 같이 특징벡터를 추출함에 있어, 음성의 종류를 고려하지 않은 채 화자 인식 시스템의 연산의 복잡도만을 고려하여 음성 분석 구간의 중첩길이를 이전 음성 분석 구간의 길이의 50% 정도로 고정하면 화자의 특성을 충분히 뽑아내기 어렵다.In fact, as described above, in extracting feature vectors as in the conventional speaker recognition system described briefly with reference to FIGS. 1 and 2, only the complexity of the calculation of the speaker recognition system is considered without considering the type of speech. Therefore, if the overlap length of the speech analysis section is fixed to about 50% of the length of the previous speech analysis section, it is difficult to extract the speaker's characteristics sufficiently.

또한, 음성 신호는 무성음, 유성음, 음소의 변화구간 등에 따라 다양한 특성을 보이고 있는데, 이와 같이 음성의 종류를 고려하지 않을 채 고정된 중첩길이를 사용하면 다양한 종류의 음성에 들어있는 화자의 특성을 모두 모델링하는 것이 어렵게 된다.In addition, the voice signal has various characteristics according to the unvoiced sound, voiced sound, and the change interval of the phoneme. Thus, when the fixed overlap length is used without considering the type of the voice, all of the speaker's characteristics included in the various types of voice Modeling becomes difficult

본 발명은 상기한 바와 같은 종래의 문제점을 해결하기 위한 것으로서, 본 발명의 목적은 실험용 입력 음성 신호의 길이 또는 실제 화자 인식용 입력 음성 신호의 길이에 대하여 미리 설정된 단위 길이(L)를 가지는 기준 분석 구간과 이 기준 분석 구간으로부터 미리 설정된 단위 간격(r)으로 이동시켜 M개의 후보 분석 구간들을 설정하여 각 구간의 특징벡터를 추출하고, 현재 기준 분석 구간의 특징벡터와 각각의 후보 분석 구간의 특징벡터 간의 스펙트럼 차이를 분석하여 현재 기준 분석 구간과의 스펙트럼 차이가 최소인 후보 분석 구간을 선택한 후, 선택된 후보 분석 구간의 특징벡터를 추출함과 동시에 선택된 후보 분석 구간과 현재 기준 분석 구간 사이의 길이를 중첩길이로 설정하여 현재 기준 분석 구간을 선택된 후보 분석 구간으로 이동시켜 새로운 기준 분석 구간과 새로운 M개의 후보 분석 구간을 재설정하는 과정을 입력 음성 신호의 길이 안에서 상기한 M개의 후보 분석 구간을 설정할 수 없을 때까지 반복하면서 음성 분석 구간의 중첩길이를 가변적으로 선택하여 실험용 입력 음성 신호 또는 실제 화자 인식용 입력 음성 신호에 대한 특징벡터를 추출하는 음성 분석구간 중첩길이의 가변적 선택을 이용한 특징 벡터 추출 방법을 제공하는 것이다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems, and an object of the present invention is to analyze a reference having a predetermined unit length (L) with respect to a length of an experimental input speech signal or a length of an input speech signal for real speaker recognition. From the section and the reference analysis section, move to a predetermined unit interval r, set M candidate analysis sections to extract feature vectors of each section, and then extract the feature vector of the current reference analysis section and the feature vector of each candidate analysis section. Analyze the spectral difference between them, select the candidate analysis section with the minimum spectral difference from the current reference analysis section, extract the feature vector of the selected candidate analysis section, and overlap the length between the selected candidate analysis section and the current reference analysis section. Length to move the current baseline analysis interval to the selected candidate analysis interval The process of resetting the new reference analysis section and the new M candidate analysis sections is repeated until the M candidate analysis sections cannot be set within the length of the input speech signal. The present invention provides a method for extracting a feature vector using a variable selection of overlapping lengths of a speech analysis section for extracting a feature vector for a speech signal or an input speech signal for real speaker recognition.

본 발명의 목적은 음성 분석구간 중첩길이의 가변적 선택을 이용한 특징 벡터 추출 방법으로 추출한 특징벡터를 이용하여 각 화자에 대한 1개씩의 화자 모델을 만들어 놓은 상태에서, 음성 분석구간 중첩길이의 가변적 선택을 이용한 특징 벡터 추출 방법으로 실제 화자 인식용 입력 음성 신호로부터 추출한 특징벡터를 상기 각 화자 모델에 대응시켜 화자로 인식될 확률 값을 계산하고, 계산 결과 그 확률 값이 가장 큰 화자 모델의 화자를 실제 화자 인식용 입력 음성의 화자로 판별하는 음성 분석구간 중첩길이의 가변적 선택을 이용한 특징 벡터 추출 방법을 이용한 화자 인식 시스템을 제공하는 것이다.An object of the present invention is to perform a variable selection of overlapping lengths of speech analysis sections in a state in which one speaker model for each speaker is created using feature vectors extracted by a feature vector extraction method using a variable selection of overlapping lengths of speech analysis sections. By using the feature vector extraction method, the feature vector extracted from the input speech signal for real speaker recognition is mapped to each speaker model, and a probability value to be recognized as a speaker is calculated, and the speaker of the speaker model having the largest probability value is the actual speaker. The present invention provides a speaker recognition system using a feature vector extraction method using a variable selection of overlapping lengths of speech analysis intervals identified by a speaker of a recognition input speech.

상기한 바와 같은 본 발명의 목적을 달성하기 위하여, 본 발명의 실시예에 따른 음성 분석구간 중첩길이의 가변적 선택을 이용한 특징 벡터 추출 방법을 이용한 화자 인식 시스템은, 실험용 입력 음성 신호의 길이 또는 실제 화자 인식용 입력 음성 신호의 길이에 대하여 미리 설정된 단위 길이(L)를 가지는 기준 분석 구간과 이 기준 분석 구간으로부터 미리 설정된 단위 간격(r)으로 이동시켜 M개의 후보 분석 구간들을 설정하여 각 구간의 특징벡터를 추출하는 특징벡터 추출부와; 상기 특징벡터 추출부에 의해 추출된 현재 기준 분석 구간의 특징벡터와 각각의 후보 분석 구간의 특징벡터 간의 스펙트럼 차이를 분석하여 현재 기준 분석 구간과의 스펙트럼 차이가 최소인 후보 분석 구간을 선택하는 분석 구간 선택부; 선택된 후보 분석 구간의 특징벡터를 추출함과 동시에 선택된 후보 분석 구간과 현재 기준 분석 구간 사이의 길이를 중첩길이로 설정하여 현재 기준 분석 구간을 선택된 후보 분석 구간으로 이동시켜 상기 특징벡터 추출부의 새로운 기준 분석 구간과 새로운 M개의 후보 분석 구간을 재설정하는 과정을 입력 음성 신호의 길이 안에서 상기한 M개의 후보 분석 구간을 설정할 수 없을 때까지 반복하면서 음성 분석 구간의 중첩길이를 가변적으로 선택하여 실험용 입력 음성 신호 또는 실제 화자 인식용 입력 음성 신호에 대한 특징벡터를 추출하는 중첩길이 가변부; 상기 중첩길이 가변부가 실험용 입력 음성 신호로부터 추출한 특징벡터를 이용하여 각 화자에 대한 화자 모델을 만드는 화자 모델 생성부; 및 상기 각 화자에 대한 화자 모델이 만들어진 상태에서, 상기 중첩길이 가변부가 실제 화자 인식용 입력 음성 신호로부터 특징벡터를 추출하면, 실제 화자 인식용 입력 음성 신호로부터 추출된 특징벡터를 상기 화자 모델 생성부에 의해 생성된 각 화자 모델에 대응시켜 화자로 인식될 확률 값을 계산하고, 계산 결과 그 확률 값이 가장 큰 화자 모델의 화자를 실제 화자 인식용 입력 음성의 화자로 판별하는 화자 인식부;로 구성된다.In order to achieve the object of the present invention as described above, the speaker recognition system using the feature vector extraction method using the variable selection of the overlapping length of the speech analysis section according to an embodiment of the present invention, the length of the experimental input speech signal or the actual speaker A reference analysis section having a predetermined unit length (L) with respect to the length of the input speech signal for recognition, and M candidate analysis sections are set by moving from the reference analysis section to a predetermined unit interval (r) to set a feature vector of each section. Feature vector extraction unit for extracting; An analysis section for selecting a candidate analysis section having a minimum spectral difference from the current reference analysis section by analyzing a spectral difference between the feature vector of the current reference analysis section extracted by the feature vector extracting section and the feature vector of each candidate analysis section; A selection unit; At the same time, the feature vector of the selected candidate analysis section is extracted and the length between the selected candidate analysis section and the current reference analysis section is set as the overlap length, and the current reference analysis section is moved to the selected candidate analysis section. The process of resetting the interval and the new M candidate analysis intervals is repeated until the M candidate analysis intervals cannot be set within the length of the input speech signal, and the overlapping lengths of the speech analysis intervals are variably selected and the experimental input speech signal or A superimposition length variable extracting a feature vector of an input speech signal for real speaker recognition; A speaker model generator for generating a speaker model for each speaker by using the feature vector extracted from the experimental input speech signal by the overlap length variable unit; And when the overlap length variable part extracts a feature vector from an input speaker signal for real speaker recognition while the speaker model is generated for each speaker, the speaker model generator generates the feature vector extracted from the input speaker signal for real speaker recognition. A speaker recognition unit that calculates a probability value to be recognized as a speaker in correspondence with each speaker model generated by the speaker, and identifies a speaker of the speaker model having the largest probability value as the speaker of the input voice for real speaker recognition as a result of the calculation; do.

상기한 바와 같은 본 발명의 목적을 달성하기 위하여, 본 발명의 실시예에 따른 음성 분석구간 중첩길이의 가변적 선택을 이용한 특징 벡터 추출 방법은, 실험용 입력 음성 신호의 길이 또는 실제 화자 인식용 입력 음성 신호의 길이에 대하여 미리 설정된 단위 길이(L)를 가지는 기준 분석 구간과 이 기준 분석 구간으로부터 미리 설정된 단위 간격(r)으로 이동시켜 M개의 후보 분석 구간들을 설정하여 각 구간의 특징벡터를 추출하는 과정과; 추출된 현재 기준 분석 구간의 특징벡터와 각각의 후보 분석 구간의 특징벡터 간의 스펙트럼 차이를 분석하여 현재 기준 분석 구간과의 스펙트럼 차이가 최소인 후보 분석 구간을 선택하는 과정; 및 선택된 후보 분석 구간의 특징벡터를 추출함과 동시에 선택된 후보 분석 구간과 현재 기준 분석 구간 사이의 길이를 중첩길이로 설정하여 현재 기준 분석 구간을 선택된 후보 분석 구간으로 이동시켜 새로운 기준 분석 구간과 새로운 M개의 후보 분석 구간을 재설정하는 과정을 입력 음성 신호의 길이 안에서 상기한 M개의 후보 분석 구간을 설정할 수 없을 때까지 반복하면서 음성 분석 구간의 중첩길이를 가변적으로 선택하여 실험용 입력 음성 신호 또는 실제 화자 인식용 입력 음성 신호에 대한 특징벡터를 추출하는 과정;으로 이루어진다.In order to achieve the object of the present invention as described above, the feature vector extraction method using the variable selection of the overlapping length of the speech analysis section according to an embodiment of the present invention, the length of the experimental input speech signal or the input speech signal for actual speaker recognition Extracting a feature vector of each section by setting M candidate analysis sections by moving the reference analysis section having a predetermined unit length L with respect to the length of the first and the predetermined unit interval r from the reference analysis section; ; Selecting a candidate analysis section having a minimum spectral difference from the current reference analysis section by analyzing the spectral difference between the extracted feature vector of the current reference analysis section and the feature vector of each candidate analysis section; And extracting the feature vector of the selected candidate analysis section and setting the length between the selected candidate analysis section and the current reference analysis section as the overlap length to move the current reference analysis section to the selected candidate analysis section and creating a new reference analysis section and a new M. The process of resetting the candidate analysis sections is repeated until the M candidate analysis sections cannot be set within the length of the input speech signal, and the overlapping length of the speech analysis section is variably selected to recognize the experimental input speech signal or the actual speaker. Extracting the feature vector for the input speech signal.

본 발명에 따라 음성 분석 구간의 중첩길이를 가변적으로 선택하면 입력 음 성 신호에 대한 화자의 다양한 특성을 충분히 뽑아내기가 용이할 뿐만 아니라, 입력 음성에 들어있는 화자의 특성을 모두 모델링하는 것이 용이하므로, 기존의 음성 분석 구간의 중첩길이를 특정한 값(예컨대, 50%)으로 고정하는 특징벡터 추출 방법을 이용하는 화자 인식 시스템에 비해 상대적으로 더 월등하게 화자 인식의 성능을 향상시킬 수 있다.According to the present invention, if the overlapping length of the speech analysis section is variably selected, it is easy to extract various characteristics of the speaker with respect to the input voice signal, and it is easy to model all the characteristics of the speaker included in the input voice. In addition, compared to the speaker recognition system using the feature vector extraction method of fixing the overlap length of the existing speech analysis section to a specific value (for example, 50%), the performance of the speaker recognition can be improved more significantly.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 더욱 상세하게 설명한다.Hereinafter, with reference to the accompanying drawings an embodiment of the present invention will be described in more detail.

도 3을 참조하면, 특징벡터 추출부(200)는 실험용 입력 음성 신호의 길이 또는 실제 화자 인식용 입력 음성 신호의 길이에 대하여 미리 설정된 단위 길이(L)를 가지는 기준 분석 구간과 이 기준 분석 구간으로부터 미리 설정된 단위 간격(r)으로 이동시켜 M개의 후보 분석 구간들을 설정하여 각 구간의 특징벡터를 추출한다.Referring to FIG. 3, the feature vector extractor 200 may include a reference analysis section having a unit length L preset for a length of an experimental input speech signal or a length of an input speech signal for real speaker recognition. M candidate analysis sections are set by moving to a predetermined unit interval r to extract feature vectors of each section.

분석 구간 선택부(210)는 상기 특징벡터 추출부(200)에 의해 추출된 현재 기준 분석 구간의 특징벡터와 각각의 후보 분석 구간의 특징벡터 간의 스펙트럼 차이를 분석하여 현재 기준 분석 구간과의 스펙트럼 차이가 최소인 후보 분석 구간을 선택한다.The analysis section selector 210 analyzes the spectral difference between the feature vector of the current reference analysis section extracted by the feature vector extractor 200 and the feature vector of each candidate analysis section, and then compares the spectral difference with the current reference analysis section. Select the candidate analysis interval of which is the minimum.

중첩길이 가변부(220)는 상기 분석 구간 선택부(210)에 의해 선택된 후보 분석 구간의 특징벡터를 추출함과 동시에 선택된 후보 분석 구간과 현재 기준 분석 구간 사이의 길이를 중첩길이로 설정하여 현재 기준 분석 구간을 선택된 후보 분석 구간으로 이동시켜 상기 특징벡터 추출부(200)의 새로운 기준 분석 구간과 새로운 M개의 후보 분석 구간을 재설정하는 과정을 입력 음성 신호의 길이 안에서 상기한 M개의 후보 분석 구간을 설정할 수 없을 때까지 반복하면서 음성 분석 구간의 중첩길이를 가변적으로 선택하여 실험용 입력 음성 신호 또는 실제 화자 인식용 입력 음성 신호에 대한 특징벡터를 추출한다.The overlap length variable unit 220 extracts the feature vector of the candidate analysis section selected by the analysis section selection unit 210 and sets the length between the selected candidate analysis section and the current reference analysis section as the overlap length. The process of resetting the new reference analysis section and the new M candidate analysis sections of the feature vector extractor 200 by moving the analysis section to the selected candidate analysis section may set the M candidate analysis sections within the length of the input speech signal. By repeating until it is impossible, the overlap length of the speech analysis section is variably selected to extract feature vectors for the experimental input speech signal or the actual speech recognition input speech signal.

화자 모델 생성부(230)는 상기 중첩길이 가변부(220)가 실험용 입력 음성 신호로부터 추출한 특징벡터를 이용하여, 예컨대 가우시안 혼합 모델(GMM; Gaussian mixture model) 방식 등으로 각 화자에 대한 화자 모델을 만든다.The speaker model generator 230 uses a feature vector extracted from the experiment input speech signal by the overlap length variable unit 220 to generate a speaker model for each speaker by using a Gaussian mixture model (GMM) method. Make.

화자 인식부(240)는 상기 각 화자에 대한 화자 모델이 만들어진 상태에서, 상기 중첩길이 가변부(220)가 실제 화자 인식용 입력 음성 신호로부터 특징벡터를 추출하면, 실제 화자 인식용 입력 음성 신호로부터 추출된 특징벡터를 상기 화자 모델 생성부(230)에 의해 생성된 각 화자 모델에 대응시켜 화자로 인식될 확률 값을 계산하고, 계산 결과 그 확률 값이 가장 큰 화자 모델의 화자를 실제 화자 인식용 입력 음성의 화자로 판별한다.If the speaker recognition unit 240 extracts a feature vector from the input speech signal for real speaker recognition in the state where the speaker model for each speaker is made, the overlapping length variable unit 220 is configured to generate the speaker from the input speech signal for real speaker recognition. By using the extracted feature vector to correspond to each speaker model generated by the speaker model generator 230, a probability value to be recognized as a speaker is calculated, and as a result of the calculation, the speaker of the speaker model having the largest probability value is used for real speaker recognition. Determined by the speaker of the input voice.

상기와 같이 구성되는 본 발명의 실시예에 따른 음성 분석구간 중첩길이의 가변적 선택을 이용한 특징 벡터 추출 방법을 이용한 화자 인식 시스템은 다음과 같이 작동한다. The speaker recognition system using the feature vector extraction method using the variable selection of the overlapping length of the speech analysis section according to the embodiment of the present invention configured as described above operates as follows.

먼저, 본 발명에 따른 음성 분석구간 중첩길이의 가변적 선택을 이용한 특징 벡터 추출 방법을 도 4를 참조하여 상세히 설명하면 다음과 같다.First, a feature vector extraction method using variable selection of overlapping lengths of speech analysis sections according to the present invention will be described in detail with reference to FIG. 4.

상기 특징벡터 추출부(200)는 실험용 입력 음성 신호 또는 실제 화자 인식용 입력 음성 신호가 입력되면(S100), 실험용 입력 음성 신호의 길이 또는 실제 화자 인식용 입력 음성 신호의 길이에 대하여 미리 설정된 단위 길이 L = 20ms를 가지는 기준 분석 구간과 이 기준 분석 구간으로부터 미리 설정된 단위 간격 r = 1ms로 이동시켜 1개의 기준 분석 구간에 대하여 M = 19개의 후보 분석 구간들을 설정하여 각 구간의 특징벡터를 추출한다(S110,S120).The feature vector extractor 200, when an experimental input speech signal or an actual speaker recognition input speech signal is input (S100), a unit length set in advance with respect to the length of the experiment input speech signal or the length of the actual speaker recognition input speech signal. The feature vector of each section is extracted by setting the reference analysis section having L = 20 ms and the M = 19 candidate analysis sections for one reference analysis section by moving the preset unit interval r = 1 ms from the reference analysis section ( S110, S120).

이어서, 상기 분석 구간 선택부(210)는 상기 특징벡터 추출부(200)에 의해 추출된 현재 기준 분석 구간의 특징벡터와 각각의 후보 분석 구간의 특징벡터 간의 스펙트럼 차이를 분석하여 현재 기준 분석 구간과의 스펙트럼 차이가 최소인 후보 분석 구간을 선택한다(S130).Subsequently, the analysis section selector 210 analyzes the spectral difference between the feature vector of the current reference analysis section extracted by the feature vector extractor 200 and the feature vector of each candidate analysis section, The candidate analysis section in which the spectral difference is minimum is selected (S130).

이때, 상기 분석 구간 선택부(210)는 하기의 수학식 1을 사용하여 추출된 현재 기준 분석 구간의 특징벡터와 각각의 후보 분석 구간의 특징벡터 간의 스펙트럼 거리를 분석하여 현재 기준 분석 구간과의 스펙트럼 거리가 최소인 후보 분석 구간을 선택한다.In this case, the analysis section selector 210 analyzes the spectral distance between the feature vector of the current reference analysis section and the feature vector of each candidate analysis section extracted by using Equation 1 below to obtain a spectrum with the current reference analysis section. Select the candidate analysis section with the smallest distance.

상기 수학식 1에서, j는 후보 분석 구간의 집합을 의미하고,

는 기준 분석 구간의 특징벡터와 후보 분석 구간의 특징벡터 간의 스펙트럼 거리(spectral distance)를 의미한다.In Equation 1, j means a set of candidate analysis intervals,

Denotes a spectral distance between the feature vector of the reference analysis section and the feature vector of the candidate analysis section.

또한, 상기 분석 구간 선택부(210)는 상기 기준 분석 구간의 특징벡터와 후 보 분석 구간의 특징벡터 간의 스펙트럼 거리(spectral distance)

로서 유클리디안 켑스트럼 거리(Euclidean cepstral distance), 마하라노비스 거리(Mahalanobis distance), 가중 켑스트럼 거리(weighted cepstral distance) 중 어느 하나를 사용한다.In addition, the analysis section selector 210 may include a spectral distance between the feature vector of the reference analysis section and the feature vector of the candidate analysis section.

Euclidean cepstral distance, Mahalanobis distance, weighted cepstral distance are used.

예컨대, 본 발명의 실시예에서 적용하고자 하는 상기 유클리디안 켑스트럼 거리(d_ECEP)는 하기의 수학식 2로 구할 수 있다.For example, the Euclidean cepstrum distance d _ECEP to be applied in the embodiment of the present invention may be obtained by Equation 2 below.

상기 수학식 2에서, p는 켑스트럼의 차수를 의미하고, c_r(i)와 c_c(i)는 각각 기준 분석 구간과 후보 분석 구간의 켑스트럼의 i번째 차수의 계수를 의미한다.In Equation 2, p denotes the order of the cepstrum, and c _r (i) and c _c (i) denotes the coefficient of the i-th order of the cepstrum of the reference analysis section and the candidate analysis section, respectively. .

또한, 본 발명의 실시예에서 적용하고자 하는 상기 마하라노비스 거리(d_MCEP)는 차수가 다른 계수들 간의 분산을 같게 맞추어 주는 방법인 하기의 수학식 3으로 구할 수 있다.In addition, the Mahalanobis distance (d _MCEP ) to be applied in an embodiment of the present invention can be obtained by Equation 3 below, which is a method of matching variance between coefficients having different orders.

상기 수학식 3에서, Cr과 Cc는 각각 기준 분석 구간과 후보 분석 구간의 켑스트럼 계수의 열벡터이고, V는 기준 분석 구간과 후보 분석 구간의 켑스트럼 계수 의 공분산 행렬이다.In Equation 3, Cr and Cc are column vectors of the cepstral coefficients of the reference analysis section and the candidate analysis section, respectively, and V is a covariance matrix of the spectral coefficients of the reference analysis section and the candidate analysis section.

또한, 본 발명의 실시예에서 적용하고자 하는 상기 가중 켑스트럼 거리(d_WCEP)는, 상기 수학식 3과 같이 공분산 행렬을 이용하면 연산량의 복잡도의 문제가 발생하므로 공분산 행렬 V의 대각선 값, 즉 켑스트럼 계수의 각 차수의 분산 값만을 이용한 방법인 하기의 수학식 4로 구할 수 있다.In addition, the weighted cepstrum distance d _WCEP to be applied in an embodiment of the present invention is a diagonal value of the covariance matrix V because the complexity of the computational problem occurs when the covariance matrix is used as in Equation 3 above. It can be obtained by the following equation (4), which is a method using only the variance value of each order of the cepstrum coefficients.

상기 수학식 4에서, w(i)는 c_r(i)와 c_c(i)의 공분산 행렬인 V의 i번째 대각선 값의 역수를 의미하고, c_r(i)와 c_c(i)는 각각 기준 분석 구간과 후보 분석 구간의 켑스트럼의 i번째 차수의 계수를 의미한다.In Equation 4, w (i) is c _r (i), and c _c (i) refers to the inverse number of the i-th diagonal value of the V-covariance matrix, and in c _r (i), and c _c (i) is Respectively, it means the coefficient of the i th order of the cepstrum of the reference analysis section and the candidate analysis section.

상기한 음성 분석 구간 간의 켑스트럼 거리를 최소화하는 수학식들은 스펙트럼 차이를 구하기 위한 하나의 실시예에 불과하며, 이 외에도 스펙트럼 차이를 최소화하기 위하여 여러 가지 수학식이 적용될 수 있다.Equations for minimizing the cepstrum distance between the voice analysis sections are only one embodiment for obtaining a spectral difference. In addition, various equations may be applied to minimize the spectral difference.

상기와 같이 분석 구간 선택부(210)에 의해 현재 기준 분석 구간과의 스펙트럼 거리가 최소인 후보 분석 구간이 선택되면, 상기 중첩길이 가변부(220)는 선택된 후보 분석 구간의 특징벡터를 추출함과 동시에(S140), 선택된 후보 분석 구간과 현재 기준 분석 구간 사이의 길이를 중첩길이로 설정하여 현재 기준 분석 구간을 선택된 후보 분석 구간으로 이동시켜 새로운 기준 분석 구간과 새로운 M = 19개의 후보 분석 구간을 재설정한 후(S150), 상기의 S110 내지 S150의 과정을 입력 음성 신호의 길이 안에서 상기한 M = 19개의 후보 분석 구간을 설정할 수 없을 때까지 반복하면서 음성 분석 구간의 중첩길이를 가변적으로 선택하여 실험용 입력 음성 신호 또는 실제 화자 인식용 입력 음성 신호에 대한 특징벡터를 추출한다(S160).As described above, when the candidate analysis section having the minimum spectral distance from the current reference analysis section is selected by the analysis section selection unit 210, the overlap length variable unit 220 extracts the feature vector of the selected candidate analysis section. At the same time (S140), the length between the selected candidate analysis section and the current reference analysis section is set as the overlap length, and the current reference analysis section is moved to the selected candidate analysis section to reset the new reference analysis section and the new M = 19 candidate analysis sections. After (S150), the process of S110 to S150 is repeated until the M = 19 candidate analysis sections cannot be set within the length of the input speech signal, and the overlapping length of the speech analysis section is variably selected to input the experiment. A feature vector of a speech signal or an input speech signal for real speaker recognition is extracted (S160).

만약, 상기의 S110 내지 S150의 과정을 반복하면서 매 회마다 선택된 후보 분석 구간의 특징벡터를 추출하다가 상기한 바와 같이 입력 음성 신호의 길이 안에서 상기한 M = 19개의 후보 분석 구간을 설정할 수 없게 되면, 상기 중첩길이 가변부(220)는 특징벡터 추출 작업을 완료한다(S170).If the feature vectors of the selected candidate analysis sections are extracted each time while repeating the processes of S110 to S150, and as described above, M = 19 candidate analysis sections cannot be set within the length of the input speech signal. The overlap length variable part 220 completes the feature vector extraction operation (S170).

이와 같이 상기 중첩길이 가변부(220)가 특징벡터 추출 작업을 완료하면, 다음으로 화자 모델 생성부(230)는 실험용 입력 음성 신호에 대해서 상기한 방법으로 추출된 특징벡터를 이용하여 가우시안 혼합 모델(GMM; Gaussian mixture model) 방식 등으로 각 화자에 대한 화자 모델을 만든다.As described above, when the overlap length variable unit 220 completes the feature vector extraction operation, the speaker model generator 230 next uses the Gaussian mixture model using the feature vector extracted in the above-described method for the experimental input speech signal. Create a speaker model for each speaker, such as the Gaussian mixture model (GMM).

상기와 같이 상기 화자 모델 생성부(230)가 실험용 입력 음성 신호로부터 추출한 특징벡터를 이용하여 각 화자에 대응하는 화자 모델을 만들어 놓은 상태에서, 상기 중첩길이 가변부(220)가 실제 화자 인식용 입력 음성 신호로부터 특징벡터를 추출하면, 상기 화자 인식부(240)는 실제 화자 인식용 입력 음성 신호로부터 추출된 특징벡터를 상기 화자 모델 생성부(230)에 의해 생성된 각 화자 모델에 대응시켜 화자로 인식될 확률 값을 계산하고, 계산 결과 그 확률 값이 가장 큰 화자 모델의 화자를 실제 화자 인식용 입력 음성의 화자로 판별하게 된다.As described above, in the state in which the speaker model generator 230 creates a speaker model corresponding to each speaker using the feature vector extracted from the experimental input speech signal, the overlapping length variable unit 220 inputs the actual speaker recognition. When the feature vector is extracted from the speech signal, the speaker recognizer 240 corresponds to each speaker model generated by the speaker model generator 230 to correspond to each speaker model generated by the speaker model input unit 230. The probability value to be recognized is calculated, and as a result of the calculation, the speaker of the speaker model having the largest probability value is determined as the speaker of the input speech for real speaker recognition.

이상에서 설명한 본 발명에 따른 음성 분석구간 중첩길이의 가변적 선택을 이용한 특징 벡터 추출 방법 및 이를 이용한 화자 인식 시스템은 상기한 실시예에 한정되지 않고, 이하의 특허청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 누구든지 다양하게 변경하여 실시할 수 있는 범위까지 그 기술적 정신이 있다.A feature vector extraction method using a variable selection of overlapping lengths of speech analysis sections and a speaker recognition system using the same according to the present invention described above are not limited to the above-described embodiments, and the gist of the present invention as claimed in the following claims. Without departing from the technical spirit of the present invention to the extent that any person of ordinary skill in the art can be changed in various ways.

도 1은 고정 음성 분석구간 설정 방법을 사용하여 특징벡터를 추출하는 종래의 화자 인식 시스템의 블록도.1 is a block diagram of a conventional speaker recognition system for extracting feature vectors using a fixed speech analysis interval setting method.

도 2는 고정 음성 분석구간 설정 방법을 사용하는 종래의 특징벡터 추출 방법의 실시예.Figure 2 is an embodiment of a conventional feature vector extraction method using a fixed speech analysis interval setting method.

도 3은 본 발명에 따른 음성 분석 구간 중첩길이의 가변적 선택을 이용한 특징 벡터 추출 방법을 이용한 화자 인식 시스템의 블록도.3 is a block diagram of a speaker recognition system using a feature vector extraction method using a variable selection of overlapping lengths of speech analysis intervals according to the present invention;

도 4는 본 발명에 따른 음성 분석구간 중첩길이의 가변적 선택을 이용한 특징 벡터 추출 방법의 실시예.4 is an embodiment of a feature vector extraction method using a variable selection of overlapping lengths of speech analysis sections in accordance with the present invention.

< 도면의 주요부분에 대한 부호의 설명 ><Description of Symbols for Major Parts of Drawings>

100: 특징벡터 추출부 110: 화자 모델 생성부100: feature vector extractor 110: speaker model generator

120: 화자 인식부 200: 특징벡터 추출부120: speaker recognition unit 200: feature vector extraction unit

210: 분석 구간 선택부 220: 중첩길이 가변부210: analysis section selector 220: overlap length variable part

230: 화자 모델 생성부 240: 화자 인식부230: speaker model generation unit 240: speaker recognition unit

Claims

M candidate analysis by moving the reference analysis section having a predetermined unit length (L) with respect to the length of the experimental input speech signal or the input speech signal for real speaker recognition and from the reference analysis section to the preset unit interval (r). A feature vector extractor 200 configured to set sections to extract feature vectors of each section;

Selecting a candidate analysis section having a minimum spectral difference from the current reference analysis section by analyzing a spectral difference between the feature vector of the current reference analysis section extracted by the feature vector extractor 200 and the feature vector of each candidate analysis section. An analysis section selection unit 210;

The feature vector extractor 200 moves the current reference analysis section to the selected candidate analysis section by extracting the feature vector of the selected candidate analysis section and setting the length between the selected candidate analysis section and the current reference analysis section as the overlap length. The process of resetting the new reference analysis section and the new M candidate analysis sections of the method is repeated until the M candidate analysis sections cannot be set within the length of the input speech signal, and then the variable length of the speech analysis section is variably selected. An overlap length variable unit 220 extracting a feature vector for an input speech signal or an input speech signal for real speaker recognition;

A speaker model generator 230 for creating a speaker model for each speaker by using the feature vectors extracted by the overlapping length variable unit 220 from an experimental input speech signal; And

In the state where the speaker model for each speaker is made, when the overlapping length variable unit 220 extracts the feature vector from the input speech signal for real speaker recognition, the feature vector extracted from the input speech signal for real speaker recognition is used. Corresponding to each speaker model generated by the model generator 230 calculates a probability value to be recognized as a speaker, and determines the speaker of the speaker model having the largest probability value as the speaker of the input speaker for real speaker recognition as a result of the calculation. Speaker recognition unit 240;

A speaker recognition system using a feature vector extraction method using a variable selection of overlapping lengths of speech analysis sections, characterized in that consisting of.

The method of claim 1, wherein the analysis section selector 210

Equation

(Where j denotes a set of candidate analysis intervals,

Spectral distance between the feature vector of the reference analysis section and the feature vector of the candidate analysis section)

Speech analysis comprising selecting a candidate analysis section having a minimum spectral distance from the current reference analysis section by analyzing the spectral distance between the feature vector of the current reference analysis section extracted and the feature vector of each candidate analysis section. Speaker Recognition System using Feature Vector Extraction Method using Variable Selection of Interval Length.

The method of claim 2, wherein the analysis section selector 210 is a spectral distance between the feature vector of the reference analysis section and the feature vector of the candidate analysis section, the Euclidean cepstral distance, A speaker recognition system using a feature vector extraction method using a variable selection of overlapping lengths of speech analysis sections, using either Mahalanobis distance or weighted cepstral distance.

M candidate analysis by moving the reference analysis section having a predetermined unit length (L) with respect to the length of the experimental input speech signal or the input speech signal for real speaker recognition and from the reference analysis section to the preset unit interval (r). Extracting feature vectors of each section by setting sections (S100, S110, and S120);

Selecting a candidate analysis section having a minimum spectral difference from the current reference analysis section by analyzing a spectral difference between the extracted feature vector of the current reference analysis section and the feature vector of each candidate analysis section (S130); And

At the same time, the feature vector of the selected candidate analysis section is extracted and the length between the selected candidate analysis section and the current reference analysis section is set as the overlap length to move the current reference analysis section to the selected candidate analysis section and the new reference analysis section and the new M The process of resetting the candidate analysis sections is repeated until the M candidate analysis sections cannot be set within the length of the input speech signal, and the overlapping length of the speech analysis section is variably selected to recognize the experimental input speech signal or the actual speaker. Extracting feature vectors for the input speech signal (S140, S150, S160);

Feature vector extraction method using a variable selection of the overlapping length of the speech analysis section, characterized in that consisting of.

The method of claim 4, wherein the candidate analysis interval selection process (S130) is performed.

Equation

(Where j denotes a set of candidate analysis intervals,

Speech analysis comprising selecting a candidate analysis section having a minimum spectral distance from the current reference analysis section by analyzing the spectral distance between the feature vector of the current reference analysis section extracted and the feature vector of each candidate analysis section. Feature vector extraction method using variable selection of interval overlap length.

6. The method of claim 5, wherein the candidate analysis section selection process (S130) includes an Euclidean cepstral distance as a spectral distance between the feature vector of the reference analysis section and the feature vector of the candidate analysis section. Method of extracting the feature vector using a variable selection of the overlapping length of the speech analysis section, characterized in that using any one of, Mahalanobis distance, weighted cepstral distance.