KR20040028790A

KR20040028790A - Speaker recognition systems

Info

Publication number: KR20040028790A
Application number: KR10-2003-7016649A
Authority: KR
Inventors: 앤드류 토마스 사펠록
Original assignee: 세큐리복스 리미티드
Priority date: 2001-06-19
Filing date: 2002-06-13
Publication date: 2004-04-03
Also published as: GB0114866D0; KR100917419B1; ZA200400292B

Abstract

음성 인식(식별 및/또는 검증) 방법 및 시스템에 있어서, 등록된 화자들을 위한 음성 모델들은 복수의 프레임 각각의 평활화된 주파수 스펙트럼을 나타내는 특징벡터들의 세트들로 이루어져 있으며, 프레임의 특징벡터들에 클러스터링 알고리즘을 적용하여 원래 음성 샘플을 나타내는 감소된 데이터 세트를 얻고, 인접한 프레임은 적어도 80%만큼 오버랩되어 있다. 이러한 타입의 음성 모델은 음성 샘플의 정적 성분들을 모델화하여 일시적 독립성을 보인다. 거의 영과 동일한 거짓 거절율을 부여하도록 모델링 및 분류 처리들이 선택되는 식별자 방식이 이용된다. 등록된 화자는 소정의 다른 등록된 화자들의 코호트와 관련되어지고, 테스트 샘플은 해당 관련 코호트 중 하나 혹은 주장된 식별정보와 항시 매칭된다. 이에 따라, 시스템의 전체적 에러율이 거짓 허용율에만 의존하게되고, 이것은 코호트 크기에 의해 결정된다. 거짓 에러율은 다중 병렬 모델링 및/또는 분류처리들을 사용하여 더욱 감소된다. 음성 모델들은 등록된 화자 샘플들 중 하나 혹은 테스트 음성 샘플로부터 (가장 바람직하게는 주장된 식별정보 등록 샘플로부터) 유도된 정규 모델을 이용하여 분류 이전에 정규화된다.In a speech recognition (identification and / or verification) method and system, speech models for registered speakers consist of sets of feature vectors representing the smoothed frequency spectrum of each of a plurality of frames, clustering into feature vectors of a frame. The algorithm is applied to obtain a reduced data set representing the original speech sample, with adjacent frames overlapping by at least 80%. This type of speech model shows the transient independence by modeling the static components of the speech sample. An identifier scheme is used in which modeling and classification processes are selected to give a false rejection rate that is approximately equal to zero. The registered speaker is associated with a cohort of any other registered speakers, and the test sample is always matched with one of the relevant cohorts or claimed identification information. Thus, the overall error rate of the system depends only on the false tolerance rate, which is determined by the cohort size. False error rates are further reduced using multiple parallel modeling and / or classification processes. The speech models are normalized prior to classification using a normal model derived from one of the registered speaker samples or from the test speech sample (most preferably from the claimed identification registration sample).

Description

Speaker recognition system {SPEAKER RECOGNITION SYSTEMS}

화자 인식은 화자 검증(verification) 및 화자 식별(identification)의 관련 분야를 포함한다. 그 주요 목적은 화자의 발음(utterances)으로부터 그 화자가 주장한 식별정보(claimed identity)을 확인하기 위한 것이거나("검증"으로 알려져 있음), 그 발음으로부터 화자를 인식하기 위한 것이다("식별"로 알려져 있음). 이러한 검증 및 식별은 바이오메트릭 수단(biometric measure)으로서 사람의 음성을 이용하고, 발음과 그 발음을 생성하는 사람 사이의 고유한 관계를 가정한다. 이러한 고유한 관계에 의해 검증 및 식별을 모두 가능하게 된다. 화자 인식기술은 테스트 발음을 분석하고 그것을 인식 및 검증되는 사람에 대해 알려져 있는 모형(template) 또는 모델과 비교한다. 그러한 시스템의 효과는 그 처리과정에서 사용되는 알고리즘들의 품질에 의존한다.Speaker recognition includes the related fields of speaker verification and speaker identification. Its main purpose is to identify the claimed identity from the speaker's utterances (known as "verification") or to recognize the speaker from the pronunciation ("identification"). Known). This verification and identification uses a person's voice as a biometric measure and assumes a unique relationship between the pronunciation and the person who generates the pronunciation. This unique relationship allows both verification and identification. Speaker recognition technology analyzes test pronunciation and compares it with a template or model known to the person being recognized and verified. The effectiveness of such a system depends on the quality of the algorithms used in the processing.

화자 인식 시스템은 가능한 많은 응용분야를 가지고 있다. 본 발명의 추가적 측면에 따르면, 화자 인식 기술은 그 내용을 보고 편집하는 모든 사람에 대한 바이오메트릭 프린트(biometric print)로 전자 문서를 영구적으로 마크하기 위해이용될 수도 있다. 이것에 의해 사용자 모두를 식별하고 접속 및 수정 회수를 식별하기 위한 오디트 트레일(audit trail)을 생성하게 된다. 사용자 마크가 바이오메트릭이므로, 사용자는 그 마크의 신뢰성에 이의를 제기하기 매우 힘들어진다.Speaker recognition systems have as many applications as possible. According to a further aspect of the invention, speaker recognition technology may be used to permanently mark an electronic document with a biometric print for everyone viewing and editing its contents. This creates an audit trail to identify all of the users and to identify the number of connections and modifications. Since the user mark is biometric, it is very difficult for the user to challenge the reliability of the mark.

그 이외의 생물 측정 수단들은 홍채스캐닝, 지문 및 얼굴 특징들과 같은 가능한 인식시스템들에 대한 기준을 제공할 수 있다. 이들 측정수단은 모두 추가적인 기록용 하드웨어를 필요로 하며, 반면에 전화 선로 등을 통한 모든 음성입력으로 혹은 표준형 멀티미디어 퍼스널 컴퓨터를 변경 없이 사용하여 화자 인식이 이용될 수 있다. 그러한 기술들은 다른 보안 수단들(security measures) 및 다른 강화된 보안을 위한 바이오 메트릭수단들과 결합하여 이용될 수 있다. 사용자의 관점에서는 그러한 시스템의 동작은 매우 단순한 것이다.Other biometric means can provide a reference for possible recognition systems such as iris scanning, fingerprint and facial features. All of these measuring means require additional recording hardware, while speaker recognition can be used with all voice inputs, such as by telephone lines, or without change using a standard multimedia personal computer. Such techniques can be used in combination with other security measures and other biometric means for enhanced security. From the user's point of view, the behavior of such a system is very simple.

예를 들어, 온라인 문서가 요청되면, 접속을 요구하는 사람은 그 음성 샘플을 제공하도록 요구될 것이다. 이것은 클라이언트 소프트웨어로부터의 단순한 촉구(prompt)("이 어구를 말해보세요") 혹은 이와 유사한 것이 될 것이다. 그리고 발음된 어구는 인터넷 등의 임의의 데이터 네트워크를 통하여 음성 인식 서버 혹은 데이터베이스 서버로 전송되어 그 문서와 관련지어지고 특정한 시간에 그 문서를 활성화시키기 위해 이용되는 키(key)로서 저장되게 된다. 따라서 시간이 경과하면 문서를 위한 영구적인 기록이 생성되어 그 문서에 대한 오디트 트레일을 제공할 수 있게 된다. 화자 인증 서버는 모든 현재 등록된 사람들의 위한 일련의 모형들(모델들)과, 이전에 등록된 사람들의 히스토리 기록을 보유할 수 있다.For example, if an online document is requested, the person requesting the connection will be required to provide the voice sample. This would be a simple prompt from the client software ("speak this phrase") or something similar. The pronounced phrases are then transmitted to a speech recognition server or database server via any data network, such as the Internet, to be stored as a key associated with the document and used to activate the document at a particular time. Thus, over time, a permanent record for the document can be created to provide an audit trail for the document. The speaker authentication server may maintain a series of models (models) for all currently registered people and a history record of previously registered people.

화자 인식시스템은 사람의 음성으로부터 몇 가지 고유한 특징들에 의존하게된다. 다시 이것은 성도(vocal tract) 및 비도(nasal tract)를 이용하여 사람 음성이 생성되는 방식에 의존하게 된다. 실용적 목적상, 성도 및 비도를 악기와 유사한 방식으로 공명하는 두개의 연결된 관으로 간주할 수 있다. 그에 따라 생성된 공명은 그 관들의 지름 및 길이에 의존하게 된다. 인간 음성 생성 구조에 있어서는, 어느 정도까지는 그 관 부분들(pipe sections)의 길이와 그 지름들이 조음기관들(articulators)(주로, 혀, 턱, 입술 및 연구개(soft palate or velum)에 의해 수정될 수 있다. 스펙트럼 상에서 그러한 공명들은 포르만트(formant) 주파수들로 칭해진다. 정상적으로는 전형적 음성 스펙트럼 상에서 약 네 개의 포르만트 주파수가 존재하게 된다.Speaker recognition systems rely on some unique features from human speech. Again this will depend on how human voices are generated using vocal tracts and nasal tracts. For practical purposes, saints and nasal sinus can be regarded as two connected tubes that resonate in a similar way to musical instruments. The resonance generated thereby depends on the diameter and length of the tubes. In human speech production, to some extent, the lengths and diameters of the pipe sections are modified by articulators (primarily tongue, jaw, lips and soft palate or velum). Such resonances on the spectrum are called formant frequencies, and normally there are about four formant frequencies on a typical speech spectrum.

악기(musical instrument)와 마찬가지로, 기류의 수축(constriction)이 생겨서 진동 및 난류가 일어날 때에 소리가 발생하게 될 뿐이다. 인간 음성의 경우에는, 성문(glottis)(성대 (vocal cords))에서 수축이 생길 때 주요 진동이 발생하게 된다. 이러한 것들이 일어나게 되면, 유성음, 주로 모음형 소리(vowel-like sounds)들이 생성되게 된다. 그러한 수축이 혀 또는 이에 의해 입안에서 일어나면, 난류가 생성되어지며(치찰음 형태의 소리(a hissing type of sound)), 생성된 음성은 "s(스)", "sh(쉬)", "th(쓰)" 등에 의해 대표되는 마찰음으로 불리게 된다. 공학적 관점에서 보면, 이것은 성도(vocal tract)의 일반적 특성들(즉 동일한 공명들)을 갖는 필터에 적용되는 소스 신호(수축의 결과)와 유사하며, 그 결과의 출력신호는 언어음(speech sound)이 된다. 실제 말(speech)은 조음기관들의 위치를 동적으로 변경하여 생성되게 된다.Like a musical instrument, the constriction of airflow occurs and only the sound is generated when vibrations and turbulence occur. In the case of human voices, major vibrations occur when contractions occur in glottis (vocal cords). When this happens, voiced sounds, usually vowel-like sounds, are produced. If such contractions occur in the tongue or thereby in the mouth, turbulence is produced (a hissing type of sound), and the generated voice is "s", "sh", "th (S) "and the like are referred to as friction sounds. From an engineering point of view, this is similar to the source signal (the result of the contraction) applied to a filter with the general characteristics of the vocal tract (ie the same resonances), and the resulting output signal is a speech sound. Becomes The actual speech is generated by dynamically changing the positions of the articulation organs.

현존하는 모든 화자 인식시스템은 유사한 연산과정들을 수행하게 된다. 이러한 시스템들은 등록된 화자에 대한 모형 또는 모델을 생성하여 동작하게 된다. 이 모델은 음성 샘플에 적용되는 2가지 주요 단계들, 즉 스펙트럼 분석 및 통계 분석에 의해 생성된다. 화자 등록 과정에서와 동일한 방식으로 입력된 샘플(테스트 발음)을 모델링하고, 등록된 화자들의 데이터베이스에 대한 입력 모델의 패턴/분류 매칭(pattern/classification matching)을 함으로써 입력된 음성 샘플에 대한 후속하는 인식처리가 수행된다. 현존하는 시스템들은 그러한 단계들 중 모든 단계 혹은 몇몇 단계를 수행할 때 취해지는 방식을 달리 하고 있다. 종래(산업표준) 시스템에 있어서는, 스펙터름 분석에는 선형 예측 코딩(Linear Predictive Coding (LPC))/셉스트럴(Cepstral) 분석 (LPCC) 혹은 FFT/서브밴딩(sub-banding) 등이 있다. 그 다음으로 숨은 마크코브 모델링(Hidden Markov Modelling (HMM))라고 불리는 통계 분석 기술이 있으며, 상술한 분류 단계는 주장한(claimed) 화자 모델에 대한 매치(match)과 "임포스터 코호트 (impostor cohort)" 혹은 "월드 모델"(즉, 그 이외의 일련의 화자 모델들)에 대한 매치의 조합이 된다.All existing speaker recognition systems perform similar computations. These systems operate by creating a model or model for the registered speaker. This model is created by two main steps applied to speech samples: spectral analysis and statistical analysis. Subsequent recognition of the input speech sample by modeling the input sample (test pronunciation) in the same manner as in the speaker registration process and by performing pattern / classification matching of the input model to the registered speaker's database. Processing is performed. Existing systems differ in the way they are taken when performing all or some of these steps. In conventional (industrial standard) systems, spectroscopic analysis includes Linear Predictive Coding (LPC) / Cepstral Analysis (LPCC) or FFT / subbanding. Then there is a statistical analysis technique called Hidden Markov Modeling (HMM), where the above classification step is used to match and "impostor cohort" against the claimed speaker model. Or a combination of matches against a "world model" (ie, a series of other speaker models).

음성 샘플(speech samples)들의 효과적 처리를 가능하기 위해서, 모든 화자 인식 시스템은 프레임이라 불리는 시간 구획(time slices)들을 이용하며, 이때 발음은 프레임들로 분할되며, 다시 각 프레임이 처리되게 된다. 프레임들은 서로 크기가 동일하거나 동일하지 않을 수 있고, 또한 중첩(overlap)되거나 중첩되지 않을 수 있다. 프레임들로 분할되는 음성 발음(speech utterance)의 전형적인 시간 신호 표현의 일례를 첨부된 도면들 중 도 1에 나타내었다. 일반적인 화자 인식 시스템은 도 2에 블록도의 형태로 나타내었고, 이 도면에는 테스트 발음이 입력 필터(10)와, 스펙트럼 분석단(spectral analysis (LPCC) stage)(12)과, 통계분석단(statistical analysis (HMM) stage)(14)을 통해 처리되고, 이어서 화자의 식별정보(identity)에 대한 판단(식별(identificaion)) 혹은 화자의 주장된 식별정보의 진실성 (검증)에 관한 정보를 생성하기 전에, 화자모델들의 데이터베이스(18)(등록된 화자 데이터-세트)를 이용하고 쓰레스홀딩(문턱값설정)(thresholding)하여 스코어 정규화(score normalization) 및 화자 분류부(16)에 의해 처리되는 것이 예시되어 있다.To enable effective processing of speech samples, all speaker recognition systems use time slices called frames, where the pronunciation is divided into frames, and each frame is processed again. The frames may or may not be the same in size with each other, and may or may not overlap. An example of a typical time signal representation of speech utterance divided into frames is shown in FIG. 1 of the accompanying drawings. A general speaker recognition system is shown in the form of a block diagram in FIG. 2, where test pronunciation is input filter 10, spectral analysis (LPCC) stage 12, and statistical analysis stage. analysis (HMM) stage) 14, and then before generating information about the speaker's identity (identification) or the speaker's claimed identity (verification). Using the database 18 of speaker models (registered speaker data-sets) and thresholding, processing by the score normalization and speaker classifier 16 is illustrated. It is.

그러한 시스템들에는 여러 가지 단점 혹은 제한들이 있다. 첫 번째로, 기존의 스펙트럼 분석기술은 제한되고 불완전한 특징세트를 생성하여 나쁜 모델링이 된다. 둘째로, HMM 기술은 "블랙박스"방법으로서, 투명성(transparency)을 희생하여 상대적으로 사용하기 용이함을 우수한 성능과 결합하는 것이다. 이러한 기술들에 의해 추출된 특징들의 상대적 중요성은 설계자들에게 있어서 가시적인 것이 아니다. 셋째로, HMM 모델들의 성질은 모델 대 모델 비교(model-against-model comparisons)가 효과적으로 수행될 수 있도록 허용하지 않는다. 따라서, 등록된 화자 안에 포함된 중요한 구조적 구체사항이 시스템 성능을 향상시키도록 효과적으로 분석 및 사용될 수 없게 된다. 넷째로, HMM 기술은 그 모델을 구성하기 위해 일시적 정보(temporal information)를 이용하며, 따라서 피치 등을 일시적으로 변경하여 다른 사람의 음성을 흉내내는 모방자들에 있어서 취약하게 된다. 다섯째로, 주장된 화자의 발음을 테스트하기 위해서 상기 시스템에 의해 이용되는 상기한월드 모델/임포스터 코호트가 쉽게 최적화될 수 없게 된다.Such systems have several disadvantages or limitations. First, existing spectral analysis techniques create limited and incomplete feature sets, resulting in poor modeling. Secondly, HMM technology is a "black box" method that combines superior performance with relatively ease of use at the expense of transparency. The relative importance of the features extracted by these techniques is not visible to the designer. Third, the nature of HMM models does not allow model-against-model comparisons to be performed effectively. Thus, important structural details contained in registered speakers cannot be effectively analyzed and used to improve system performance. Fourth, HMM technology uses temporal information to construct the model, and thus becomes vulnerable to mimics who mimic the voices of others by temporarily changing pitch or the like. Fifth, the world model / imposter cohort used by the system to test the claimed speaker's pronunciation cannot be easily optimized.

화자 인식 시스템의 성능은 다음과 같은 점에 의존하게 된다. 즉, 진정한 화자(true speaker) 발음이 그 화자를 위한 모델에 대해서 테스트되면, 임포스터 발음이 동일한 모델에 대해 테스트될 때 생성되는 스코어보다 낮은 스코어를 생성하게 된다는 사실에 의존하게 된다. 이에 따라 허용/거절 문턱값이 설정되게 된다. 진정한 화자에 의한 연속된 테스트들은 동일한 스코어가 생성하지 않을 것이다. 그 대신, 그 스코어들은 통계적 분포를 형성할 것이다. 하지만, 진정한 화자 분포의 평균은 동일한 모델에 대해 테스트된 임포스터 분포들의 평균에 비해 상당히 낮게 될 것이다. 이것을 도 3에 나타내었으며, 이 도면에는 화자 1이 진정한 화자일 때 8명의 화자에 대해서 각각 25개의 스코어가 작성되었다. 도 3을 보면, 몇몇 화자들의 스코어가 다른 화자들보다 더 진정한 화자에 근접하여 문제가 될 수 있다는 것을 알 수 있다.The performance of the speaker recognition system depends on the following points. That is, if true speaker pronunciation is tested against a model for that speaker, it will depend on the fact that the imposter pronunciation produces a score lower than the score produced when tested against the same model. Accordingly, the allow / reject threshold is set. Successive tests by a true speaker will not produce the same score. Instead, the scores will form a statistical distribution. However, the mean of the true speaker distribution will be significantly lower than the mean of the imposter distributions tested for the same model. This is shown in Fig. 3, in which 25 scores were made for each of the eight speakers when speaker 1 was the true speaker. 3, it can be seen that the scores of some speakers can be a problem closer to the true speaker than others.

본 발명은 화자(speaker) 인식을 수행하기 위한 시스템, 방법 및 장치에 관한 것이다.The present invention relates to a system, method and apparatus for performing speaker recognition.

도 1은 대화 발음(speech utterance)을 프레임들로 분할하는 예의 신호 표현을 나타낸 도면;1 illustrates a signal representation of an example of dividing speech utterance into frames;

도 2는 일반적인 종래 화자 인식 시스템을 나타낸 블록도;2 is a block diagram showing a general conventional speaker recognition system;

도 3은 테스트된 화자의 개수에 대한 화자 인식 스코어 분포를 나타낸 플롯도;3 is a plot showing speaker distribution score distributions for the number of speakers tested;

도 4는 본 발명의 제1 실시예를 나타낸 블록도;4 is a block diagram showing a first embodiment of the present invention;

도 5는 본 발명의 제2 실시예를 나타낸 블록도;5 is a block diagram showing a second embodiment of the present invention;

도 6은 본 발명의 제3 실시예를 나타낸 블록도;6 is a block diagram showing a third embodiment of the present invention;

도 7은 본 발명에 따른 화자 인식 시스템의 또 다른 실시예를 나타낸 블록도;7 is a block diagram showing another embodiment of a speaker recognition system according to the present invention;

도 8a는 프레임들로 분할된 음성 발음의 예의 시간 신호 표현을 나타낸 도면;8A shows a time signal representation of an example of speech pronunciation divided into frames;

도 8b는 그중 하나의 프레임의 대응하는 주파수 스펙트럼 및 평활화된 주파수 스펙트럼을 나타낸 도면;8B shows the corresponding frequency spectrum and smoothed frequency spectrum of one frame therein;

도 9는 2개의 잘못 배열된 프레임들(mis-aligned frames)의 주파수 스펙트럼들 사이의 차이를 나타낸 도면;9 shows the difference between frequency spectra of two mis-aligned frames;

도 10은 그 발생빈도수에 대해서 그려진 축적된 프레임 스코어들의 분포를 나타낸 도면;Fig. 10 shows the distribution of accumulated frame scores plotted against their frequency of occurrence;

도 11a는 도 11b와의 비교를 위한 것으로 도 3과 동일한 축적된 스코어 분포들은 나타낸 도면으로, 본 발명에 따른 화자 인식 시스템을 이용하여 얻어진 대응하는 축적 스코어 분포들을 나타낸 도면;FIG. 11A is a diagram showing the same accumulated score distributions as in FIG. 3 for comparison with FIG. 11B, showing corresponding accumulated score distributions obtained using the speaker recognition system according to the present invention; FIG.

도 12는 본 발명에 따른 시스템을 이용해서 얻어진 것으로 실제 테스트 스코어와 비교할 경우의 모델 대 모델 비교의 결과들을 나타낸 도면;12 shows the results of a model-to-model comparison when compared with an actual test score obtained with the system according to the invention;

도 13은 다차원 데이터스페이스의 2차원 표현에 있어서의 본 발명에 따른 시스템에 의해 사용되는 화자 모델의 분포를 나타낸 도면;13 shows the distribution of a speaker model used by the system according to the invention in a two-dimensional representation of a multidimensional data space;

도 14는 본 발명에 따른 시스템에서 이용되는 것으로서 임포스터 코호트의 사용을 나타낸 도면;14 illustrates the use of an imposter cohort as used in a system according to the present invention;

도 15는 본 발명의 한 측면에 따른 정규화 과정을 나타내는 블록도;15 is a block diagram illustrating a normalization process in accordance with an aspect of the present invention;

도 16은 본 발명에 따른 넓은 영역의 사용자 인증 시스템(wide area user authentication system)의 예를 나타낸 블록도;16 is a block diagram illustrating an example of a wide area user authentication system in accordance with the present invention;

도 17은 화자 인신 시스템의 입력 채널에서의 다양한 노이즈 소스들 및 채널 특성들에 의해 음성 신호의 쇠퇴(corruption)를 나타낸 블록도;FIG. 17 is a block diagram illustrating the disruption of a speech signal by various noise sources and channel characteristics in an input channel of a speaker human system;

도 18 및 도 19는 화자 인식 시스템에서의 테스트 발음들 및 등록 모델들에미치는 노이즈 및 채널 특성들의 영향을 나타낸 도면; 및18 and 19 illustrate the effect of noise and channel characteristics on test pronunciations and registration models in a speaker recognition system; And

도 20은 본 발명의 한 측면에 따른 채널 정규화 방법을 나타낸 도면.20 illustrates a channel normalization method according to an aspect of the present invention.

본 발명은 기존의 시스템에 비해 향상된 성능을 제공하는 향상된 화자 인식 방법 및 시스템에 관한 것이다. 본 발명은 여러 가지 다른 측면에 있어서, 향상된 스펙트럼 분석, 그 통계 분석에서의 투명성, 향상된 모델링, 그 데이터-세트 구조가 시스템성능을 향상시키기 위해 분석되고 사용될 수 있도록 비교될 수 있는 모델들, 향상된 분류방법들, 시스템성능을 향상시키기 위한 통계적으로 독립적/부분적으로 독립적 병렬 처리들의 사용 등을 포함하는 (그것들에 한정되는 것은 아님) 향상들(improvements)을 제공한다.The present invention relates to an improved speaker recognition method and system that provides improved performance over existing systems. The invention provides, in many other respects, improved spectral analysis, transparency in statistical analysis, improved modeling, models that can be compared so that the data-set structure can be analyzed and used to improve system performance, and improved classification. Methods, including, but not limited to, the use of statistically independent / partially independent parallelisms to improve system performance, and the like.

또한, 본 발명은 본 발명의 방법 및 시스템을 구현하기 위한 컴퓨터 프로그램과, 그러한 프로그램으로 부호화된 데이터 캐리어들 및 스토리지 매체과, 그 방법 및 시스템을 구현하기 위해 적용되는 데이터 처리 장치 및 시스템과, 그 장치 및 시스템을 포함하는 데이터처리시스템 및 장치 등을 포함한다.In addition, the present invention provides a computer program for implementing the method and system of the present invention, data carriers and storage media encoded with such a program, a data processing apparatus and system applied to implement the method and system, and the device And data processing systems and apparatuses including systems.

본 발명의 다양한 측면들 및 바람직한 특징들에 대해서는 첨부된 청구범위에서 정의되어 있다.Various aspects and preferred features of the invention are defined in the appended claims.

본 발명은 향상된 화자 인식 (검증 및/또는 식별) 시스템을 제공하기 위하여 여러 가지 방식으로 결합될 수 있는 여러 가지 측면들 및 특징들을 포함한다. 본 발명의 소정 측면들은 음성 샘플들이 화자 등록 도중에 및 입력된 음성 샘플들의 후속하는 인식 도중에 모델화되는 방식과 관련되어 있다. 그 이외의 측면들은 입력 음성 모델들이 화자의 식별정보에 관한 결정에 이를 수 있도록 분류되는 방식과 관련되어 있다. 또 다른 측면은 화자 인식 시스템에 입력된 음성 신호들을 정규화하는 것(채널 정규화)과 관련되어 있다. 또 다른 측면은 화자 인식 시스템들의 응용과 관련되어 있다.The present invention includes various aspects and features that can be combined in various ways to provide an improved speaker recognition (verification and / or identification) system. Certain aspects of the present invention relate to the manner in which speech samples are modeled during speaker registration and during subsequent recognition of input speech samples. Other aspects relate to the way in which the input speech models are classified to reach a decision about the speaker's identification. Another aspect relates to normalizing the speech signals input to the speaker recognition system (channel normalization). Another aspect relates to the application of speaker recognition systems.

이하 도면을 참조하면, 도 4 내지 도 6은 본 발명의 다양한 측면들을 구체화한 시스템에서 이용되는 기본 구조를 나타내고 있다. 여기에서 기술하는 본 발명의 실시예들 모두에 대한 입력들은 임의의 적절한 수단들(도시하지 않음)에 의해 미리 디지털화된 음성 샘플들을 포함하는 디지털 신호들이며, 필터들 모두와 참조되는 다른 모듈들도 디지털이라는 점을 알 수 있을 것이다.Referring now to the drawings, Figures 4 through 6 illustrate the basic structure used in a system incorporating various aspects of the present invention. The inputs to all of the embodiments of the invention described herein are digital signals containing speech samples that have been digitized in advance by any suitable means (not shown), and all the filters and other modules referenced are also digital. You can see that.

도 4에서, 음성 샘플은 채널 정규화 모듈(200) 및 필터(24)를 통해 시스템에 입력된다. 이러한 "전방-말단(front-end)" 정규화 대신에 혹은 그에 추가하여, 채널 정규화는 음성 샘플을 처리하는 이후 과정에서 수행될 수 있으며, 이것에 대해서는 이후에 논의하도록 할 것이다. 그 샘플은 필터(24)에 입력되기 전에 혹은 특징 추출 이전의 다른 몇몇 지점에서 일련의 프레임들로 분할 되게 된다. 몇몇 실시예에서는, 이후에 기술되는 바와 같이, 노이즈 신호(206)가 필터링된 신호에 추가될 수 있다(혹은 필터(24) 이전에 추가될 수 있다). 샘플 데이터는 모델링 (특징 추출) 모듈(202)에 입력되며, 이것은 스펙트럼 분석 모듈(26)과 (적어도 등록을 위해 처리되는 음성 샘플의 경우에) 통계적 분석 모듈(28)을 포함한다. 모델링 모듈(202)로부터 출력되는 모델(특징 세트)은 입력된 음성 샘플의 평활된(smoothed) 주파수 스펙트럼을 나타내는 일련의 계수(a set of coefficients)를 포함한다. 화자의 등록 도중에는, 그 모델은 등록된 화자의 데이터베이스(도시하지 않음)에 추가된다. 입력 음성 샘플의 인식 도중에는, 모델(특징 세트)은 분류 모듈(110)에 입력되고, 이것은 그 모델(특징 세트)을 등록된 화자들의 데이터로부터 선택된 모델들과 비교하게 된다. 이러한 비교결과에 기초하여, 화자를 식별하고 화자의 주장된 식별정보를 검증하도록 (204)에서 결정이 이루어지게 된다. 입력 샘플의 채널 정규화 및 노이즈 신호(206)의 추가는 이것에 대해서는 이후 좀더 구체적으로 설명하는 바와 같이, 본 발명의 여러 측면들을 구성하는 것이며, 본 발명의 모든 구현예들의 바람직한 특징들에 해당하는 것이다. 몇몇 실시예들에서는, 채널 정규화는 도 4 내지 도 6에 나타낸 바와 같은 처리 이전에 입력 음성 샘플에 적용되기 보다는 분류과정 도중에 혹은 스펙트럼 분석(26) 이후에 적용될 수 있다. 본 발명의 다른 측면들에 따른 모델링 및 분류 과정들의 새로운 측면들에 대해서도 이후 좀더 구체적으로 설명할 것이다.In FIG. 4, speech samples are input to the system through channel normalization module 200 and filter 24. Instead of or in addition to such "front-end" normalization, channel normalization may be performed later in the processing of speech samples, which will be discussed later. The sample is divided into a series of frames at some other point before input to filter 24 or before feature extraction. In some embodiments, the noise signal 206 may be added to the filtered signal (or may be added before the filter 24), as described later. Sample data is input to a modeling (feature extraction) module 202, which includes a spectral analysis module 26 and a statistical analysis module 28 (at least in the case of speech samples processed for registration). The model (feature set) output from the modeling module 202 includes a set of coefficients that represent the smoothed frequency spectrum of the input speech sample. During speaker registration, the model is added to a registered speaker's database (not shown). During the recognition of the input speech sample, a model (feature set) is input to the classification module 110, which compares the model (feature set) with the models selected from the registered speaker's data. Based on this comparison, a determination is made at 204 to identify the speaker and verify the claimed identification of the speaker. The channel normalization of the input sample and the addition of the noise signal 206 constitute several aspects of the present invention, as described in more detail later on, which corresponds to the preferred features of all embodiments of the present invention. . In some embodiments, channel normalization may be applied during classification or after spectral analysis 26 rather than being applied to input speech samples prior to processing as shown in FIGS. 4-6. New aspects of modeling and classification processes according to other aspects of the present invention will be described in more detail later.

본 발명의 다른 측면들에는 등록 및/또는 인식을 위한 음성 샘플의 처리과정에 있어서의 다양한 형태의 병렬처리들이 포함된다.Other aspects of the invention include various forms of parallel processing in the processing of speech samples for registration and / or recognition.

도 5를 참조하면, 시스템의 기본 동작은 다음의 것을 제외하면 도 4의 것과 동일하다. 즉, 다른점은 모델링 모듈(202)로부터의 출력은 다중, 병렬 분류 처리들(110a, 110b,..., 110n)에 입력되며, 다중 분류 처리들로부터의 출력들은 최종 결론에 다다르기 위해서 결합되는 것이며, 이것에 대해서는 이후 좀더 구체적으로 설명할 것이다. 도 6에서, 시스템의 기본 동작은 다음의 것을 제외하면 도4의 것과 동일하다. 즉, 입력 샘플은 (주로, 이후 기술되는 바와 같이 서로 약간 다른 특징 추출/모델링을 제공하는) 다중, 병렬 모델링 처리들(202a, 202b, ..., 202n)에 의해 처리되며, 이 처리는 (노이즈 신호(206)가 필터(24a,24b,...24n)의 상단(upstream)에 있는 입력신호에 추가되는 것으로 보이는 경우) 다중 필터(24a, 24b, ..., 24n)를 경유하는 것이 가능하고, 또한 다중 모델링 처리들로부터의 출력들은 분류 모듈(110)에 입력되며, 이것에 대해서도 이후 좀더 구체적으로 설명할 것이다. 바람직하게는, 이러한 형태의 다중 병렬 모델링 과정들은 등록 샘플 데이터 및 테스트 샘플 데이터 모두에 적용되어진다.Referring to FIG. 5, the basic operation of the system is the same as that of FIG. 4 except for the following. That is, the difference is that the output from modeling module 202 is input to multiple, parallel classification processes 110a, 110b, ..., 110n, and the outputs from multiple classification processes are combined to reach the final conclusion. This will be described in more detail later. In FIG. 6, the basic operation of the system is the same as that of FIG. 4 except for the following. That is, the input sample is processed by multiple, parallel modeling processes 202a, 202b, ..., 202n (mainly providing slightly different feature extraction / modeling as described later), which process ( If the noise signal 206 appears to be added to an input signal upstream of the filters 24a, 24b, ... 24n, it is via the multiple filters 24a, 24b, ..., 24n. Possible, and also outputs from the multiple modeling processes are input to the classification module 110, which will be described in more detail later. Preferably, this type of multiple parallel modeling procedures apply to both registration sample data and test sample data.

다중 병렬 모델링 과정들은 다중 병렬 분류 과정들과 결합될 수도 있으며, 예를 들어 도 5의 병렬 분류 과정들(110a-n) 각각에 대한 입력은 도 6에 나타낸 바와 같은 다중 병렬 모델링 처리들로부터의 출력일 될 수 있다.Multiple parallel modeling processes may be combined with multiple parallel classification processes, for example the input to each of the parallel classification processes 110a-n of FIG. 5 is output from the multiple parallel modeling processes as shown in FIG. 6. Can be

이하, 본 발명의 여러 가지 다양한 측면들에 대해서 도 4 내지 도 6에 나타낸 모델링, 분류 및 정규화 과정들을 참고하여 좀더 구체적으로 설명한다.Hereinafter, various aspects of the present invention will be described in more detail with reference to the modeling, classification, and normalization processes illustrated in FIGS. 4 to 6.

모델링modelling

스펙트럼 분석 모듈(26,26a-n)은 유사한 스펙트럼 분석방법들을 기존의 화자 인식 시스템들에 사용되는 것들에 적용할 수 있다. 바람직하게는, 모듈(26a-n)에 의해 적용되는 스펙트럼 분석은, 샘플데이터의 각 프레임에 대해서, 그 프레임의 평활화된 주파수 스펙트럼을 나타내는 일련의 특징 벡터들(계수들)을 추출하는 형태를 취한다. 바람직하게, 이것은 그 스펙트럼의 더욱 섬세한 구체 사항을 모델링하는 증가된 특징 세트를 생성하는 LPC/셉스트럴 (LPCC) 모델링을 포함하는데, 웨이팅 방식에 기초하여 선택된 계수들의 델타 셉트럴(delta ceptral) 혹은 엠퍼시스/디엠퍼시스(de-emphasis)와 같은 변형들을 포함할 수 있다. 그 대신에, 고속 푸리에 변환(Fast Fourier Transform (FFT))과 같은 다른 수단들 혹은 필터 뱅크(filter bank)에 의해 그와 유사한 계수들을 얻을 수 있다.The spectral analysis module 26, 26a-n can apply similar spectral analysis methods to those used in existing speaker recognition systems. Preferably, the spectral analysis applied by modules 26a-n takes the form of extracting, for each frame of sample data, a series of feature vectors (coefficients) representing the smoothed frequency spectrum of that frame. do. Preferably, this includes LPC / ceptral (LPCC) modeling, which produces an increased set of features modeling the more delicate details of the spectrum, the delta ceptral of the coefficients selected based on the weighting scheme or May include variations such as emphasis / de-emphasis. Instead, similar coefficients can be obtained by other means, such as a Fast Fourier Transform (FFT) or by a filter bank.

완전 샘플은 그 샘플의 각 프레임의 하나의 행(row)의 계수들로 이루어진 행렬에 의해 표현된다. 본 발명의 바람직한 실시예에 있어서, 이들 행렬은 각각 1000 (프레임들) x 24 (계수들)의 차수(order)의 크기를 갖게 될 것이다. 기존 시스템에서는, 완전 원 신호(complete original signal)를 나타내는 이러한 형태의 제1 단일 행렬에 대해서 HMM과 같은 통계적 분석이 행해졌을 것이다.A complete sample is represented by a matrix of coefficients of one row of each frame of that sample. In a preferred embodiment of the invention, these matrices will each have an order size of 1000 (frames) x 24 (coefficients). In existing systems, statistical analysis such as HMM would have been performed on this type of first single matrix representing a complete original signal.

당업자에 의해 알 수 있는 바와 같이, LP 변환에 의해 테스트 발음의 각 프레임에 대한 평활화된 주파수 스펙트럼을 나타내는 일련의 필터 계수들이 효과적으로 생성되게 된다. LP필터 계수들은 Z-평면 폴(poles)들과 관련되어 있다. 셉스트럴 변환은 평활화된 스펙트럼의 동적 범위(dynamic range)를 압축하여 이것들을Z-평면 원점에 근접하게 (z=e^jw에 있어서 실 주파수 축(real frequency axis)으로부터 멀리) 이동시킴으로써 LP 폴들을 디엠퍼시스하는 효과를 갖는다. 셉스트럴 변환은 이 목적을 위해 로그 함수를 이용한다. 평활화된 주파수 스펙트럼을 얻고, 그 폴들을 디엠퍼시스하기 위해 그 이외의 유사하거나 균등한 기술들이 음성 샘플의 스펙트럼 분석에서 이용될 수 있다는 점을 알 수 있을 것이다. 이러한 디엠퍼시스에 의해 시간 영역으로 다시 변환될 때 덜 동적이게 되고, 더욱 잘 균형적으로 되는 일련의 계수들을 생성하게 된다(셉스트럴 계수들은 디엠퍼시스된 폴들을 갖는 LP 필터의 시간 신호 혹은 임펄스 응답과 유사하다). 로그 함수는 또한 배수사 처리(multiplicative processes)들을 부가적 처리(additive processes)로 변환한다.As will be appreciated by those skilled in the art, the LP transform effectively produces a series of filter coefficients representing the smoothed frequency spectrum for each frame of test pronunciation. LP filter coefficients are related to Z-plane poles. The cepstral transform compresses the dynamic range of the smoothed spectrum and moves the LP poles by moving them closer to the Z-plane origin (away from the real frequency axis at z = e ^jw ). It has the effect of deemphasis. The septral transformation uses log functions for this purpose. It will be appreciated that other similar or equivalent techniques may be used in the spectral analysis of speech samples to obtain a smoothed frequency spectrum and to de-emphasize those poles. This de-emphasis results in a series of coefficients that are less dynamic and more balanced when converted back to the time domain. Similar to). The logarithm function also converts multiplicative processes into additive processes.

음성 샘플로부터 유도된 모델은 샘플 신호의 주파수 콘텐츠에 기초한 일련의 특징 벡터들로 간주할 수 있다. 주파수 콘텐츠에 기초한 특징벡터이 신호로부터 추출되는 경우, 벡터의 차수(order)는 중요하다. 그 차수가 너무 작으면, 몇몇 중요한 정보가 모델화되지 않을 수 있다. 이러한 것을 피하기 위해, 특징 추출기(extractor)의 차수(예를 들어 LP 필터의 폴들의 개수)를 예측된 차수보다 크도록 선택되어질 수 있다. 하지만, 이것은 본질적으로 문제들을 발생하게 된다. 신호 안에서 공명(resonances)들을 매칭시키는 폴들은 좋은 결과들을 제공하는 반면에, 특징 벡터의 다른 결과로서의 계수들은 신호의 의사적 요소들(spurious aspects)을 모델링하게 된다. 따라서, 이러한 벡터를 다른 모델 혹은 기준과 비교하면, 연산된 거리 측정값(distance measure)이 신호의 의사적 요소들을 모델링하고 있는 그 계수들의 값들에 의해 당연히 영향을 받을 수 있게 된다. 복귀되는 거리 측정값(스코어)은 따라서 부정확하게 되며, 실제로 좋은 매칭인 프레임에 대해 나쁜 스코어를 부여할 가능성이 있게 된다.The model derived from the speech sample can be regarded as a series of feature vectors based on the frequency content of the sample signal. When the feature vector based on the frequency content is extracted from the signal, the order of the vector is important. If the order is too small, some important information may not be modeled. To avoid this, the order of the feature extractor (eg the number of poles of the LP filter) may be chosen to be larger than the predicted order. However, this inherently causes problems. The poles that match the resonances in the signal provide good results, while the coefficients as another result of the feature vector model the spurious aspects of the signal. Thus, comparing this vector with other models or criteria, the calculated distance measure can naturally be influenced by the values of those coefficients modeling the pseudo elements of the signal. The distance measurement (score) that is returned is therefore inaccurate, and is likely to give a bad score for a frame that is actually a good match.

본 발명의 한 측면에 따르면, 이러한 문제는 그 신호가 모델링 처리에 입력되기 전에 음성 신호(s(t))에 알려져 있는 (기지의) 특성들을 갖는 노이즈 신호(n(t))(도 4 내지 도 6의 206)를 추가함으로써 미연에 방지되거나 완화될 수 있다(즉, 입력신호 = s(t) + n(t)). 이와 동일한 노이즈 신호는 화자의 등록 도중과 시스템의 후속하는 사용에서 이용될 수 있다. 상기 기지의 노이즈 신호를 추가하는 것은 (실제로 필요한 개수 이상의) "엑스트라" 계수들로 하여금 강제로 기지의 함수를 모델화하고 따라서 모델/테스트 벡터 비교 도중에 좀더 문제가 적은 일관된 결과들을 부여하게 하는 효과를 갖게 된다. 이것은 특히 음성 샘플 데이터의 "사일런스(silence)" 도중에 노이즈(채널 노이즈 및 그 이외의 노이즈)의 효과를 억제하기 위한 것과 관련된다. 이러한 문제는 이후 기술하게 될 대규모로 오버랩된 샘플 프레임들의 사용의 결과로서 다루어질 수도 있다.According to one aspect of the present invention, this problem is caused by a noise signal n (t) having known (known) characteristics in the speech signal s (t) (FIGS. 4 to 4) before the signal is input to the modeling process. By adding 206 of FIG. 6, it can be prevented or mitigated beforehand (ie, input signal = s (t) + n (t)). This same noise signal can be used during speaker registration and subsequent use of the system. Adding the known noise signal has the effect of forcing "extra" coefficients (actually more than necessary) to model the known function and thus to give more trouble-free consistent results during model / test vector comparisons. do. This is particularly relevant for suppressing the effects of noise (channel noise and other noise) during the "silence" of speech sample data. This problem may be addressed as a result of the use of largely overlapping sample frames, which will be described later.

상술한 바와 같이, 음성 샘플들의 효과적인 처리를 가능하게 하기 위해, 모든 화자 인식 시스템은 프레임들이라 칭하는 시간 단편들을 이용하여, 발음이 일련의 프레임들로 분할되고, 각 프레임은 다시 처리되게 된다. 프레임들은 서로 동일한 크기이거나 혹은 동일한 크기가 아닐 수 있고, 중첩될 수 있다. 따라서 화자 인식 시스템에 의해 생성된 모델들은 복수의 프레임들을 나타내는 복수의 특징 세트들(계수들의 세트들에 대응하는 벡터들)을 포함한다. 기존의 화자 인식 시스템에서 모델들이 비교될 경우, 각 모델의 대응하는 프레임들을 정렬할 필요가 있다. 주어진 어구에 대해서 서로 다른 발음들은 동일한 사람이 말한 경우에도 정확히 동일한 길이를 갖게 될 수 없다. 따라서 비교를 위해 프레임들을 정확히 정렬하는데 어려움이 있게 된다.As mentioned above, to enable effective processing of speech samples, all speaker recognition systems use time fragments called frames, whereby the pronunciation is divided into a series of frames, each frame being processed again. The frames may be the same size or not the same size, or may overlap. The models generated by the speaker recognition system thus comprise a plurality of feature sets (vectors corresponding to sets of coefficients) representing a plurality of frames. When the models are compared in the existing speaker recognition system, it is necessary to align the corresponding frames of each model. Different pronunciations for a given phrase cannot have exactly the same length, even if they are spoken by the same person. This makes it difficult to align the frames correctly for comparison.

기존 시스템은 프레임들을 (프레임들로 분할된 시간 신호를 나타내는) 도 8a 및 (도 8a의 플레임들 중 하나의 대응하는 주파수 스펙트럼 및 평활화된 주파수 스펙트럼을 나타내는) 도 8b에 나타낸 바와 같이 스펙트럼 혹은 평활화된 스펙트럼 등가(a spectral or smoothed spectral equivalent)로 변환하게 된다. 이어서, 이 시스템은 (셉스트럴 변환, 벡터 정량화(Vector Quantization), 숨은 마크코브 모델링(HMM) 및 동저 시간 와핑(Dynamic Time Warping(DTW) 등의) 추가적인 변환들 및 분석을 행하여 원하는 결과를 얻게 된다. 프레임 경계들은 여러 가지 방식으로 할당될 수 있지만, 주로 유용한 음성신호의 시작점이라고 추측된 임의의 시작점으로부터 측정되어 진다. 이러한 임의의 시작점을 보상하기 위해, 또한 유사한 소리의 길이에 있어서의 자연적 변화를 보상하기 위해서, 모델들을 구성할 경우 또는 테스트 발음들을 모델들과 비교할 경우와 같은, 2개 이상의 발음들을 비교할 경우, HMM 및 DTW와 같은 기술들이 이용될 수 있다. 일반적으로, 프레임의 콘텐츠를 나타내기 위해 이용되는 어떠한 계수들을 이용하여도, HMM/DTW 보상은 스펙트럼 분석 이후의 시스템의 지점에서 행해지며, 원래의 시간 신호를 참조하지 않는다. 따라서, 배열 정확도(alignment precision)는 프레임의 크기에 한정되지 않는다. 또한, 이들 기술은 특정 프레임의 배열이 거기에 있게 될 것이라고 예측되는 몇몇 프레임들내에 있는 발음의 고정된 영역 내에 있을 것이라는 가정을 한다. 이것은, 현재 프레임의 추측된 배열이 이전 프레임의 배열에 의존하고 후속하는 프레임들의 배열이 현재 프레임의 배열에 의존함에 따라, 시간 요소를 시스템에 도입하게 된다. 실제로, 이것이 의미하는 것은 일반적으로, 발음에 포함되는 200ms 동안 존재하는 프레임과 같은, 특정 프레임이 모델의 혹은 모델을 구성하는데 이용되는 다른 발음들의 200ms 영역에 있는 다른 프레임들과 비교되어질 뿐이다. 이러한 접근방식은 음성 인식 방법들(예를 들어, 음성에서 문자로의 변환)로부터 유도되며, 이때 그것은 일련의 프레임들로부터 음소 시퀀스(phonetic sequence)를 추정하는데 이용된다. 본 출원인은 다음과 같은 이유로 이러한 방식이 음성 인식에 있어서 부적절한 것이라고 판단한다.Existing systems have either framed or smoothed frames as shown in FIG. 8A (representing a time signal divided into frames) and FIG. 8B (representing a corresponding frequency spectrum and smoothed frequency spectrum of one of the frames of FIG. 8A). Convert to a spectral or smoothed spectral equivalent. The system then performs additional transformations and analysis (such as Septral Transform, Vector Quantization, Hidden Markcove Modeling (HMM), and Dynamic Time Warping (DTW)) to achieve the desired results. Frame boundaries can be assigned in various ways, but are primarily measured from any starting point that is assumed to be the starting point of a useful voice signal, to compensate for this starting point, and also to natural variations in the length of similar sounds. To compensate for, techniques such as HMM and DTW may be used when comparing two or more pronunciations, such as when constructing models or comparing test pronunciations with models. Using any coefficients used to produce the HMM / DTW compensation, at the point in the system after spectral analysis And thus do not refer to the original time signal, so the alignment precision is not limited to the size of the frame, and these techniques also allow for some frames to be predicted that the arrangement of a particular frame will be there. It assumes that it will be in a fixed region of pronounced pronunciation, which introduces a temporal element into the system as the estimated arrangement of the current frame depends on the arrangement of the previous frame and the arrangement of subsequent frames depends on the arrangement of the current frame. In practice, this typically means that a particular frame, such as a frame that exists for 200 ms in pronunciation, is only compared to other frames in the 200 ms region of the model or other pronunciations used to construct the model. This approach can be used for speech recognition methods (eg voice to text). Is derived from a ring), where it is used to estimate the phoneme sequence (phonetic sequence) from a series of frames. The applicants have determined that this method for the following reasons inadequate in speech recognition.

A) 가장 심각한 것으로서, 기존 방식은 프레임들의 조잡한 배열(crude alignment)만을 제공한다. 시작점들의 임의의 할당이 의미하는 것은 일반적으로 2개의 프레임 각각의 시작점들의 정확한 배열을 얻을 수 없어서, 도 9에 나타낸 바와 같이, "최선의 매치(best match)"를 부여하는 2개의 프레임 조차도 서로 상당히 다른 스펙트럼 특성을 가질 수 있다는 점을 의미한다.A) As the most serious, the existing approach provides only a crude alignment of the frames. Whatever allocation of starting points means, in general, it is not possible to obtain an exact arrangement of the starting points of each of the two frames, so that even two frames giving a "best match" are significantly different from each other, as shown in FIG. It means that it can have different spectral characteristics.

B) 두 번째로, 기존 방식은 그 프레임들의 일시적 시퀀스에 의존하며, 화자 검증을 일시적으로 인접한 프레임들로부터 유추된 스펙트럼 특성들에 기초로 하고 있다.B) Secondly, the existing scheme relies on a temporal sequence of the frames, and speaker verification is based on spectral characteristics inferred from temporally adjacent frames.

본 발명의 또 다른 측면에 따르면, 모델들 사이의 프레임 배열로부터 발생하는 문제들을 극복하고(상기 (A)에서 기술됨), 얻어진 모델의 품질을 향상시키기 위해, 본 발명의 등록 모델링 처리에 있어서 컨벌루션과 유사한 매우 큰 프레임 오버랩을 사용하는 것이 포함된다. 이러한 기술은 모델을 얻기 위해(등록 어구(enrollment phrase)에 대한 반복된 발음에 기초하는 것이 바람직함), 화자 등록 도중에 적용된다. 프레임들을 대규모로 오버랩함으로써, 그 결과로서의 모델이 인접한 프레임들 사이에 비교적 차이가 적은 가능한 모든 배열들의 모델에 효과적으로 접근하게 되어, 패턴들의 우수한 모델링을 제공하게 된다. 바람직하게는, 프레임 오버랩이 적어도 80%가 되도록 선택되어지며, 좀더 바람직하게는, 80%~90%의 범위가 되고, 95%까지도 가능하다.According to another aspect of the invention, in order to overcome the problems arising from the frame arrangement between the models (as described in (A) above) and to improve the quality of the obtained model, convolution in the registration modeling process of the invention This involves using very large frame overlap similar to. This technique is applied during speaker registration to obtain a model (preferably based on repeated pronunciation of an enrollment phrase). By massively overlapping the frames, the resulting model effectively approaches the model of all possible arrangements with relatively little difference between adjacent frames, providing good modeling of the patterns. Preferably, the frame overlap is selected to be at least 80%, more preferably in the range of 80% to 90%, up to 95% possible.

이들 프레임은 상술한 바와 같은 LPCC 변환을 이용하여 대표적 계수들(representative coefficients)로 변환되어, 등록 처리에 의해 생성된 기준 모델에 포함되는 각 발음이 (전형적으로, 상술한 바와 같이 24개의 계수들과 1000개의 프레임들의 차수의 크기를 갖는) 행렬에 의해 표현되게 된다. 전형적으로는, 10개의 발음들을 표현하는 그러한 10개의 행렬이 존재할 수 있다. 그리고, 벡터 정량화(이후 추가로 설명함)와 같은 클러스터화(clustering) 또는 평균화 기술이 화자의 기준 모델을 생성하기 위해 데이터를 감소시키는데 이용된다. 이러한 모델은 프레임들의 일시적 차수(temporal order)에 의존하지 않아서, 상기 (B)에서 언급한 문제들을 다루게 된다.These frames are transformed into representative coefficients using the LPCC transform as described above, so that each pronunciation included in the reference model generated by the registration process is (typically, 24 coefficients as described above). The size of the order of 1000 frames). Typically, there may be ten such matrices representing ten pronunciations. Clustering or averaging techniques, such as vector quantification (described further below), are then used to reduce the data to generate the speaker's reference model. This model does not depend on the temporal order of the frames, thus dealing with the problems mentioned in (B) above.

본 발명의 바람직한 실시예들에서는, 상술한 프레임들의 대규모 오버랩과 후술하는 바와 같은 벡터 정량화 등을 결합하게 된다. 이에 따라 기존의 HMM/DTW 시스템과 상당히 다른 동작 모드가 제공되게 된다. 그러한 기존의 시스템에서, 모든 프레임들은 동일하게 유효하다고 간주되었고, 일반적으로 각 프레임들을 비교하고 매칭시켜서 유도된 스코어들을 축적하는 것에 의해, 쓰레스홀딩을 위한 최종 "스코어"을 예/아니오 결정으로 유도하는데 이용되었다. 이렇게 얻어진 스코어들의 유효성은 프레임 배열들의 정확도에 의해 한정되어진다.In preferred embodiments of the present invention, a large overlap of the above-described frames and vector quantification as described below are combined. This provides a significantly different operating mode than the existing HMM / DTW system. In such an existing system, all frames were considered equally valid and generally lead to a yes / no determination of the final "score" for thresholding by accumulating the scores derived by comparing and matching each frame. It was used to The validity of the scores thus obtained is limited by the accuracy of the frame arrangements.

본 발명의 이러한 측면에 따르면, 기준 (등록) 모델들은 많은 수의 가능한 프레임 배열들을 나타내게 된다. 이것은, 테스트 발음의 각 프레임들을 매칭시키고, 프레임들의 각 매칭되는 쌍에 대해서 스코어들을 유도하기 보다는, 테스트 발음의 모든 프레임들이 비교되고 기준 모델의 모든 프레임에 대해서 스코어처리가 되도록 하여 프레임 스코어 값들의 발생 주파수의 통계적 분포를 부여하게 된다. "좋은" 프레임 매치들(good frame matches)은 낮은 스코어를 야기하게 되고, "나쁜" 프레임 매치들은 높은 스코어를 야기하게 된다(혹은 스코어링 방식에 따라 역이 될 수 있음). 많은 개수의 기준 모델들에 대해 테스트된 테스트 발음 프레임은 결국 도 10에 나타낸 바와 같이 정상 분포(normal distribution)가 될 것이다. 대부분의 프레임 스코어들은 평균치에 근접한 위치에 있게 되고, 그로부터 약간의 표준 편차(standard deviations) 이내에 있게 된다. 기준 모델들에 있어서의 프레임들의 대규모 오버랩 때문에, 스코어 분포들은 기준 모델들과 테스트 발음의 정확히 배열된 대응하는 프레임들 사이의 "최선의 매치들"을 포함할 것이다. 따라서, 특정 화자로부터의 테스트 발음이 그 화자를 위한 기준 모델에 대해서 테스트되면,그 분포는 매우 낮은 스코어들의 보다 높은 발생율(higher incidence)을 포함하게 될 것이다. 이것은 궁극적으로 발음의 어떤 부분들은 진정한 화자로부터 온 것이라고 쉽게 식별되는 반면, 진정한 화자로부터 온 것이라는 것이 덜 명확한 다른 부분들은 일반 대중으로부터 온 것에 의해 분류되기 때문에, 결과적으로 "진정한 화자" 스코어들이 일관되게 낮은 값이 되게 된다. 임포스터 프레임 스코어들은 낮은 스코어들을 생성하지 않을 것이며 일반 대중으로부터 온것으로서 분류될 것이다.According to this aspect of the invention, the reference (registration) models will represent a large number of possible frame arrangements. This results in the generation of frame score values such that all frames of the test pronunciation are compared and scored for all frames of the reference model, rather than matching each frame of the test pronunciation and deriving scores for each matching pair of frames. Giving a statistical distribution of frequencies. "Good" frame matches result in low scores, and "bad" frame matches result in high scores (or may be reversed depending on the scoring scheme). The test pronunciation frame tested for a large number of reference models will eventually be a normal distribution as shown in FIG. 10. Most frame scores will be close to the mean and within some standard deviations. Because of the large overlap of frames in the reference models, the score distributions will include "best matches" between the reference models and the correctly arranged corresponding frames of the test pronunciation. Thus, if the test pronunciation from a particular speaker is tested against a reference model for that speaker, the distribution will contain higher incidence of very low scores. This ultimately results in consistently low "true speaker" scores, since some parts of the pronunciation are easily identified as coming from the true speaker, while others are less clearly identified as coming from the true speaker. Will be the value. Imposter frame scores will not produce low scores and will be classified as coming from the general public.

즉, 본 발명의 이러한 측면에 따르면, 기준 모델들은 복수의 대규모로 오버랩된 프레임들을 위해 유도된 계수들의 세트들을 포함하며, 테스트 발음은 테스트 발음의 프레임들 모두를 관련 기준 모델들의 프레임들 모두와 비교하고 그것으로부터 얻어진 프레임 스코어들의 분포를 분석함으로써 테스트되게 된다.That is, according to this aspect of the invention, the reference models comprise sets of coefficients derived for a plurality of largely overlapped frames, wherein the test pronunciation compares all of the frames of the test pronunciation with all of the frames of the relevant reference models. And by analyzing the distribution of frame scores obtained from it.

등록을 위한 음성 샘플들에 적용되는 프레임들의 대규모 오버랩들은 후속하는 화자 인식 도중에 발음들을 입력하는데 적용될 수도 있으며, 이것은 필수사항은 아니다.Large overlaps of frames applied to speech samples for registration may be applied to input pronunciations during subsequent speaker recognition, which is not required.

등록 샘플 데이터에서의 대규모 오버랩들(massive overlaps)을 사용하는 것은 샘플 데이터에 있어서 사일런스 기간들(periods of silence) 내에 존재하는 노이즈로부터 야기되는 문제들을 다루는데 있어서도 유익하다. 그러한 문제들은 문자-무관형(text-independent) 화자 인식 시스템에 있어서 특히 중요하다. 사일런스가 존재하면 개별 모델 혹은 검증 시도에 대해서 문제를 일으킬 수 있거나 일으키지 않을 수 있지만, 전체 시스템 성능을 악화시키는 원인일 될 것이다. 따라서 어떻게 이것을 완전히 제거하거나 그 역효과를 최소화 하는 것이 현안(question)이된다. 본 발명에 따라 대규모 프레임 오버랩들을 사용하는 것은 본질적 해결책을 지니는 것이다. 프레임 스펙트럼의 평균을 구하는 것을 나타내고 있는 다음과 같은 수식에 대해서 생각해보도록 한다(이것에 대해서는 이후에 상세히 설명하기로 한다).Using massive overlaps in registration sample data is also beneficial in addressing problems resulting from noise present within the periods of silence in the sample data. Such problems are particularly important for text-independent speaker recognition systems. The presence of silence may or may not cause problems for individual models or verification attempts, but will cause overall system performance to deteriorate. Therefore, the question is how to eliminate it completely or minimize its adverse effects. Using large frame overlaps in accordance with the present invention is an inherent solution. Consider the following equation that represents the average of the frame spectrum (this will be described in detail later).

이 수식으로부터 정적인 부분들(static parts)(ss)의 평균은 ss(ω)이 되고, 각 프레임들은 스펙트럼 ss_n(ω)x sd_n(ω)을 갖는다는 것을 알 수 있는데, 2개의 추가된 프레임들(ss₁(ω)xsd₁(ω)+(ss₂(ω)xsd₂(ω))=ss(ω)x(sd₁(ω)+sd₂(ω))의 스펙트럼에 대해서 생각해보기로 한다. 여기에서, 정상 부분(steady part)에 새로운 스펙트럼(sd₁(ω)+sd₂(ω))을 곱하였다. 하지만, 새로운 스펙트럼은 평균화에 의해 감소되게 되어 있고, 또한 본질적으로 동적이거나 가변이기 때문에, 무작위로 추출된 프레임과 완전히 동일한 방식으로 행동해야만 한다. 이것이 의미하는 것은 프레임들이 성능에 최소한의 영향을 주면서 프레임들이 함께 랜덤하게 추가될 수 있다는 것이다. 순 결과(net result)가 유효한 음성 프레임인 사일런스 프레임들에더해진 유효 음성 프레임들의 경우를 가질 수 있기 때문에, 그러한 판단은 전적으로 옳은 것은 아니다. 사실상 이것은 모델링에 있어서 원하지 않는 사일런스들을 더 이상 포함하지 않음에 따라 결과적으로 성능 향상을 가져오게 된다.From this equation we can see that the average of the static parts (ss) is ss (ω), and each frame has the spectrum ss _n (ω) x sd _n (ω) For the spectrum of the received frames ss ₁ (ω) xsd ₁ (ω) + (ss ₂ (ω) xsd ₂ (ω)) = ss (ω) x (sd ₁ (ω) + sd ₂ (ω)) Consider, here, the steady part is multiplied by the new spectrum (sd ₁ (ω) + sd ₂ (ω)), but the new spectrum is supposed to be reduced by averaging and also essentially Because they are dynamic or variable, they must behave in exactly the same way as randomly extracted frames, which means that the frames can be added randomly together with minimal impact on performance. May have the case of valid speech frames added to silence frames that are valid speech frames. As such, such a judgment is not entirely correct—in fact, this results in performance improvements as the model no longer contains unwanted silences.

몇 가지 사소한 사일런스 문제들을 갖는 전형적 신호가 랜덤하게 더해진 시간프레임들을 갖는다면, 그 사일런스들은 제거되지만 그 신호는 주요한 변조(major corruption)를 겪은 것으로 보이게 될 것이다. 하지만, 대규모로 오버랩된 프레임들을 이용하는 본 발명은 여전히 기능을 하게 된다. 흥미롭게도, 이것이 의미하는 것은 채널 에코(channel echo)들이 아무런 효과가 없으며 무시될 수 있다는 것이다. 이것은 또한, 본 발명의 바람직한 동작모드들에 의해 기존의 검증기(verifiers)(이후에 추가로 설명함)보다 더욱 많이 스펙트럼의 정적인 부분들을 추출하게 된다는 사실을 분명히 나타내고 있다. 이러한 방식으로 프레임들을 추가하는 것은 유색의 노이즈(coloured noise)를 추가하는 것과 실질적으로 동일한 효과를 갖는 것이 되어 상술한 바와 같이 원하지 않는 모델링을 방지하게 된다.If a typical signal with some minor silence problems has randomly added timeframes, the silences will be removed but the signal will appear to have undergone major corruption. However, the present invention using massively overlapped frames still functions. Interestingly, this means that channel echoes have no effect and can be ignored. This also clearly indicates that the preferred modes of operation of the present invention will extract more static portions of the spectrum than conventional verifiers (described further below). Adding frames in this manner will have substantially the same effect as adding colored noise, thus preventing unwanted modeling as described above.

본 발명의 다른 측면에 따르면, 기존의 화자 인식 시스템에서 이용된 통계 분석 기술과 다른 방식으로 모듈(28,28a-n)에 의해 적용된 벡터 정량화와 같은 클러스터링 혹은 평균화 기술들을 이용한다.According to another aspect of the present invention, clustering or averaging techniques such as vector quantification applied by modules 28, 28a-n are used in a manner different from the statistical analysis techniques used in existing speaker recognition systems.

바람직하게, 본 발명의 시스템은 스펙트럼 분석 모듈(26,26a-n)로부터 출력된 등록 샘플 데이터를 처리하는데 있어서 벡터 정량화(Vector Quantization (VQ)) 기술을 이용한다. 이것은 많은 종래의 시스템에서 이용되는 HMM과 같은 통계 분석 기술과 비교하여 기술이 단순하게 되어, 결과적으로 후속하는 분류 단계에서의 모델 대 모델 비교를 가능하게 하는 형태로 모델들을 제공하는 투명한 모델링이 된다. 또한, 본 발명에서 이용되는 것으로서의 벡터 정량화(VQ)는 일시적 정보를 이용하지 않아서 시스템이 임포스터들에 대해 저항력을 갖게 된다.Preferably, the system of the present invention uses the Vector Quantization (VQ) technique in processing the registration sample data output from the spectrum analysis modules 26, 26a-n. This simplifies the technique compared to statistical analysis techniques such as HMM used in many conventional systems, resulting in transparent modeling that provides models in a form that allows model-to-model comparison in subsequent classification steps. In addition, vector quantification (VQ) as used in the present invention does not use transient information such that the system is resistant to imposters.

벡터 정량화 처리는 데이터 포인트들의 클러스터들을 식별하고, 각 클러스터에 대한 평균 값들을 결정하고, 어떠한 클러스터에도 확실히 속하지 않는 데이터를 포기함으로써 LPCC 출력 데이터를 효과적으로 압축하게 된다. 이것은 결과적으로 제1 행렬들의 세트의 LPCC 데이터를 나타내지만 감소된 크기(전형적으로, 예를 들어, 100 x 24와 비교하여 64 x 24)를 갖는 제2 계수들을 갖는 제2 행렬들의 세트가 된다.The vector quantification process effectively compresses the LPCC output data by identifying clusters of data points, determining average values for each cluster, and abandoning data that certainly does not belong to any cluster. This in turn results in a set of second matrices representing LPCC data of the first set of matrices but having a reduced coefficient (typically 64 x 24 compared to 100 x 24, for example).

이하, 본 발명에서 LPCC 스펙트럼 분석 및 클러스터링/평균화를 이용하는 것의 효과에 대해서 설명하기로 한다.Hereinafter, the effects of using LPCC spectrum analysis and clustering / averaging in the present invention will be described.

그 기본 모델은 스펙트럼 크기가 유용한 것이며, 그 위상이 무시될 수 있음이라는 가정을 한다. 이것은 인간 청취(human hearing)에 적용되는 것으로 알려져 있고, 검증기검증기ier)에 적용되지 않는다면, 시스템은 마이크의 화자로부터의 거리에 대한 민감함과 같은 바람직하지 않은 위상 관련 문제들을 보이게 될 것이다. 또한, 음성 샘플의 스펙트럼 정보가 2가지 부분들, 즉 정적 부분(ss(ω))과 동적 부분(sd(ω))으로 구성되는 것으로 간주할 수 있고 그 처리들은 배수사 처리(multiplicative processes)라는 가정을 한다. 또한, 그 동적 부분은 그 정적 부분보다 상당히 크다고 가정한다:The basic model assumes that spectral magnitude is useful and that its phase can be ignored. This is known to apply to human hearing, and if not applied to the verifier, the system will show undesirable phase related problems such as sensitivity to the microphone's distance from the speaker. In addition, the spectral information of a speech sample can be regarded as being composed of two parts, a static part ss (ω) and a dynamic part sd (ω), and the processes are called multiplicative processes. Make a home. Also assume that the dynamic part is considerably larger than the static part:

s(ω) = ss(ω) x sd(ω)s (ω) = ss (ω) x sd (ω)

정의에 의해 정적 부분이 고정됨으로써 성도(vocal tract)의 정적 특성들과 관련되게 되는 만큼 바이오메트릭으로서 더욱 유용하게 된다. 이에 따라 그 측정(measure)이 음성의 동역학(dynamics)과 관련되는 sd(ω)와 반대가 되는 것으로서 몇 가지 고정된 물리적 특성과 관련되어지게 된다.The static part is fixed by definition, making it more useful as a biometric as it relates to the static properties of the vocal tract. This results in some fixed physical properties as opposed to sd (ω), which relates to the dynamics of speech.

ss(ω)의 완전한 추출을 하면 물리적 바이오메트릭의 특징들을 보이는 바이오메트릭, 즉, 임의로 변경될 수 없고 시간의 경과에 따라 악화되지 않는 물리적 바이오메트릭을 부여하게 될 것이다. 그 대신에, sd(ω)을 배타적으로 사용하면, 행동적 바이오메트릭의 특징들을 보이는 바이오메트릭, 즉, 임의로 변경될 수 있고 시간이 경과하면 악화되게 되는 행동적(behavioral) 바이오메트릭을 부여하게 될 것이다. 이들 두 가지를 혼합하면 중간적 특징들을 보여야 하지만, sd(ω)가 ss(ω)보다 매우 크기 때문에, 그 조합이 sd(ω)의 특징들을 보이게 될 가능성(즉, 행동적이 될 가능성)이 더욱 높다.Complete extraction of ss (ω) will give a biometric that exhibits the characteristics of the physical biometric, i.e. a physical biometric that cannot be arbitrarily altered and does not deteriorate over time. Instead, exclusive use of sd (ω) would give a biometric that exhibits the characteristics of a behavioral biometric, i.e., a behavioral biometric that can be arbitrarily altered and deteriorates over time. will be. Mixing these two should show intermediate features, but since sd (ω) is much larger than ss (ω), the combination is more likely to show the features of sd (ω) (i.e., become behavioral). high.

신호의 모든 주파수 표현들과 같이, 그 가정은 시간 신호가 물리적으로는 분명 가능하지 않은 -∞ ~ +∞ 사이에 존재하게 되는 것이다. 실제로는, 신호의 모든 스펙트럼 추정치들이 윈도우를 이용하여 만들어지며, 이것은 시간의 유한한 기간동안 존재하게 된다. 이 윈도우는 (해밍 윈도우(Hamming window)와 같은) 함수에 의해 정형화(shaped)되거나 직각(rectangular)이 될 수 있다.As with all frequency representations of the signal, the assumption is that the time signal is between -∞ and + ∞, which is not physically possible. In practice, all spectral estimates of the signal are made using a window, which is present for a finite period of time. This window can be shaped or rectangular by a function (such as a Hamming window).

직각 윈도우의 이용하면 단순히 관심대상 영역 내의 신호의 섹션을 취하고 그 것이 다른 곳에서는 제로(zero)라는 가정을 하는 것이 된다. 이러한 기술(technique)은 신호의 섹션들이 프레임이라고 불리는 음성 처리에서 공통되는것이다. 도 1은 지시된(indicated) 프레임들을 갖는 시간 신호를 나타낸 것이다.Using a rectangular window simply takes a section of the signal in the region of interest and assumes that it is zero elsewhere. This technique is that sections of the signal are common in speech processing called frames. 1 shows a time signal with indicated frames.

프레임들은 교호 윈도우(alternate window)를 이용하여 정형화될 수 있다. 흥미롭게도, 윈도우처리에 의한 주요 효과는 특정 주파수의 특징이 그 주변주파수로 확산되는 일종의 스펙트럼 평균화(spectral averaging)이다. 이 효과는 주요 로브(main lobe)에 의해 야기되는 것으로, 이것 이외에도 사이드 로브들은 스펙트럼 상에서 주기적인 스펙트럼 진동(oscillation)을 생성하게 된다. 이후에 본 발명의 시스템은 모든-폴 선형 예측 계수(all-pole Linear Predication coefficients)을 추출하게 되며, 이것은 스펙트럼 평활화의 의도된 효과를 갖게 되며, 윈도우처리(windowing)에 의해 야기된 여분의 평활화(extra smoothing)는 주요 문제로서 볼 수 없게 된다. 하지만, 윈도우 크기가 부주위하게 변경되었다면 그 주기적 사이드 로브(side lobe)의 효과가 곤란하게 될 수 있을 것이다. 하지만 이것은 좋은 관리유지(housekeeping)에 의해 회피할 수 있게 된다.Frames can be formatted using alternating windows. Interestingly, the main effect of windowing is a kind of spectral averaging in which the characteristic of a particular frequency spreads to its surrounding frequency. This effect is caused by the main lobe, in addition to which side lobes generate periodic spectral oscillations in the spectrum. The system of the present invention then extracts all-pole linear prediction coefficients, which have the intended effect of spectral smoothing, and the extra smoothing caused by windowing. extra smoothing is not a major problem. However, if the window size is changed inadvertently, the effect of the periodic side lobe may be difficult. But this can be avoided by good housekeeping.

시간 신호를 프레임들로 분할 할 수 있다고 한다면, 프레임들(1 ~ N)에 대한 스펙트럼 특성들은 다음과 같이 표현될 수 있다.If it is possible to divide the time signal into frames, the spectral characteristics for the frames 1 to N may be expressed as follows.

s₁(ω)= ss₁(ω) x sd₁(ω); s₂(ω)= ss₂(ω) x sd₂(ω); ..................s ₁ (ω) = ss ₁ (ω) x sd ₁ (ω); s ₂ (ω) = ss ₂ (ω) x sd ₂ (ω); ......................

s_n(ω)= ss_n(ω) x sd_n(ω);.......s _n (ω) = ss _n (ω) x sd _n (ω); .......

s_N(ω)= ss_N(ω) x sd_N(ω)s _N (ω) = ss _N (ω) x sd _N (ω)

그러나 다음과 같은 정의에 의하면,But by the definition:

ss(ω)=ss₁(ω)=ss₁(ω)=ss₂(ω)=ss₃(ω)=....=ss_N(ω)ss (ω) = ss ₁ (ω) = ss ₁ (ω) = ss ₂ (ω) = ss ₃ (ω) = .... = ss _N (ω)

첫 번째 결과로는 ss(ω)을 추출하는 것이 다음과 같은 평균화처리를 이용하여 가능하게 되는 것처럼 보이게 된다.The first result seems to be to extract ss (ω) using the following averaging process.

만약 프레임들이 독립적인 스펙트럼 특징들(각각 랜덤 처리에 의한 것임)을 갖는다면, U(ω)는 백색노이즈가 되는 경향이 있게 될 것이다. 즉, 이 경우 U(ω)는 플랫 스펙트럼을 갖게 되어가 스펙트럼을 평활화함으로써 추출될 수 있게 될 것이다. 이것은 N이 매우 크게 되면(N -> ∞) 그 경우가 될 가능성이 가장 클 것이다. 시간 도메인 - 주파수 도메인 - 시간 도메인 변환들의 선형 특성(linear nature)이 주어진 경우에는, 그와 유사한 분석이 시간 도메인에서 기술될 수 있었을 것이다.If the frames have independent spectral characteristics (each by random processing), U (ω) will tend to be white noise. That is, in this case U (ω) will have a flat spectrum Can be extracted by smoothing the spectrum. This is most likely the case if N becomes very large (N-> ∞). Given the linear nature of the time domain-frequency domain-time domain transforms, a similar analysis could be described in the time domain.

실제 세상의 조건들에 있어서는, 프레임들이 독립적인 스펙트럼 특성들을 지닌다는 의미에서 N이 크게 된다는 가정을 할 수 없을 것이다. 이는 다음과 같은 두 가지의 조건하에서 N이 크게 될 것을 필요로 한다는 점을 기억하는 것이 중요하다.Under real world conditions, one cannot assume that N becomes large in the sense that the frames have independent spectral characteristics. It is important to remember that this requires N to become large under the following two conditions.

1. 모델 생성 도중1. During model creation

2. 검증 이벤트 도중(during verification event)2. During the verification event

이와 같은 처리 도중조건에 따르지 못한다면 잠재적으로 시스템 고장(에러)을 일으키게 될 것이나, 조건 1을 만족하지 않으면 갱신될 때까지 에러의 잠재적 원인으로 남게 될 것이므로 좀더 심각한 것이 되고, 반면에 조건 2에서 문제가 있으면 단일 사례 이벤트(single instance event)가 될 것이다.Failure to comply with the conditions during such processing will potentially cause a system crash (error), but if condition 1 is not met, it will remain a potential source of error until it is updated, which is more serious, while condition 2 If it is, it will be a single instance event.

U(ω)가 백색 노이즈로 수렴되는 것이 보장될 수 없다면, 그러한 상황에 대처하기 위해 어떠한 것이 행해질 수 있을 것인가?If U (ω) cannot be guaranteed to converge to white noise, what can be done to cope with that situation?

우선 다음과 같은 것을 생각해 보기로 한다.First, consider the following.

1. U(ω)는 가변 양(variable quantity)이 될 것임.1. U (ω) will be a variable quantity.

2. 주파수 스펙트럼에 걸쳐서 평활화되면 이상적으로는 플랫이 될 것임, 즉, 평활화된 형태(smoothed version) Usm(ω) = 1.2. Smoothed over the frequency spectrum would ideally be flat, ie smoothed version Usm (ω) = 1.

3. U(ω)는 그 개수가 이상적으로는 무한대가 되는 경향이 있는 음성 프레임들의 절단 합계(truncated sum)임.3. U (ω) is the truncated sum of speech frames whose number tends to ideally be infinite.

다음과 같은 수식을 생각해보기로 한다.Consider the following equation.

만약 다음과 같은 프레임 기반 등가(frame based equivalent)로 복귀한다면,플랫 스펙트럼이 되려는 경향이 있는 합계 부분(summation part)은 이상적인 성능 측정값(performance measure)은 아니다.If we return to the following frame based equivalent, the summation part that tends to be the flat spectrum is not an ideal performance measure.

만약 다음과 같이 그 프레임의 대수(logarithms)를 취하면,If we take logarithms of that frame like this:

위 수식에서 이제 정적 및 동적 부분 사이의 관계가 부가적이 됨을 알 수 있다. 시간 도메인과 주파수 도메인 사이의 관계가 선형적이기 때문에, 주파수로부터 시간으로의 변환에 의해 다음과 같이 된다.From the above equation, we can now see that the relationship between the static and dynamic parts is additive. Since the relationship between the time domain and the frequency domain is linear, the conversion from frequency to time is as follows.

lss(ω) + lsd(ω) -> cs(τ) + cd(τ) = c(τ)lss (ω) + lsd (ω)-> cs (τ) + cd (τ) = c (τ)

신호 처리에 있어서 c(τ)는 상술한 바와 같이 s(t)의 셉스트럴 변환으로 알려져 있다.In signal processing, c (?) Is known as a spectral transform of s (t) as described above.

일반적으로 셉스트럴 분석은 time_domain -> frequency_domain -> log(spectrum) -> time_domain(시간_도메인 -> 주파수_도메인 -> log(스펙트럼) -> 시간_도메인)로 이루어져 있다. 셉스트럴 변환은 음성 분석에 있어서 여러 가지형태로 이용되고 있다.In general, septal analysis consists of time_domain-> frequency_domain-> log (spectrum)-> time_domain (time_domain-> frequency_domain-> log (spectrum)-> time_domain). Septral transformation is used in many forms in speech analysis.

상술한 바와 같이, 본 발명의 사용에 있어서, 프레임들을 위한 셉스트럴 계수들을 생성하여 다음과 같이 정적 부분을 추출한다.As mentioned above, in the use of the present invention, the septal coefficients for the frames are generated to extract the static portion as follows.

이상적으로는, 음성 신호의 길이는 동적 부분이 완전히 랜덤하게 되도록 충분히 길게 되고 그 평균(mean)이 영이 되는 경향이 있을 것이다. 이것은 그 정적 부분(cs(t))을 본 발명의 바이오메트릭 측정(measure)으로서 남기게 될 것이다. 하지만, 다음과 같은 해결해야할 여러 가지 문제점들이 있다.Ideally, the length of the voice signal will be long enough so that the dynamic part is completely random and the mean will tend to be zero. This will leave the static part cs (t) as the biometric measure of the present invention. However, there are various problems to be solved as follows.

1. 합계에서 제로까지의 불완전한 성질(imperfect nature of sum-to-zero)을 어떻게 다룰 것인가에 대한 방법1. How to deal with the imperfect nature of sum-to-zero

2. 채널 변화2. Channel change

3. 엔드포인팅(endpointing)3. Endpoint Pointing

4. 부가적 노이즈(additive noise)4. additive noise

합계에서 제로까지의 불완전 성질을 참조하면, 셉스트럴 계수들의 성질은 시간이 증가함에 따라 쇠퇴(decay)하게 되며, 안정된 시스템들에 대한 임펄스 응답의 모습을 취하고 있다. 이것은 각 계수의 동적 범위가 서로 다르며 그것들은 일반적으로 내림차순(descending order)이라는 것을 의미한다.Referring to the incomplete nature of the sum to zero, the properties of the Septral coefficients decay over time, taking the form of an impulse response to stable systems. This means that the dynamic range of each coefficient is different and they are generally in descending order.

임포스터 모델의 프레임 계수 값들과 진정한 화자 모델의 프레임 계수 값들과 테스트 샘플의 평균 계수들 사이의 차이가 크지 않으며, 거리 스코어(distance score)를 생성하기 위한 발음의 프레임들 모두에 대한 단순 합계(simple summation)를 하는 것은 기존 방법으로는 쓰레스홀드(threshold)하기 힘들 것이라는 것을 보일 수 있다.The difference between the frame coefficient values of the imposter model, the frame coefficient values of the true speaker model, and the mean coefficients of the test sample is not significant, and a simple sum for all of the pronunciation frames to produce a distance score. Summation may show that it will be difficult to threshold in the traditional way.

이러한 방법과 관련한 2가지 어려운 문제들에 대해서 서로 분리하지 않고 함께 생각해보면, 그 문제의 답이 나오게 된다. 다시 강조하기 위해 그 2가지 어려운 점들에 대해서 기술하면 다음과 같다.If you think about the two difficult problems of this method together without separating them, you will find the answer. To reemphasize the two difficult points:

1. 발음들은 영으로 수렴하는 동적 부분의 평균을 위해 길이가 결코 충분히 길지 않게 될 것임.The pronunciations will never be long enough for the average of the dynamic part of the convergence to zero.

2. 진정한 화자와 임포스터들 사이의 차이들은 작게 되어 쓰레스홀드하기 어려움.2. The differences between true speakers and imposters are small and difficult to hold.

다음과 같은 수식에 기초한 모델들을 갖는 2명의 화자들을 고려해보기로 한다.Consider two speakers with models based on the following equation.

따라서 그 모델들은 m1(τ)과 m2(τ)가 되고, 이때 m1(τ)는 다음과 같다.Thus, the models become m1 (τ) and m2 (τ), where m1 (τ) is

이때, e1(τ)는 에러이다.At this time, e1 (τ) is an error.

벡터형태로는 상기 모델들은 다음과 같이 된다.In vector form, the models are as follows.

동일한 형태로 표현하면 화자 1로부터의 시험 발음은 다음과 같이 된다.Expressed in the same form, the test pronunciation from Speaker 1 is as follows.

단순 거리 척도(measure)를 이용하면, 진정한 화자 거리는 다음과 같이 된다.Using a simple distance measure, the true speaker distance is

또한, 임포스터 거리는 다음과 같이 된다.In addition, the imposter distance becomes as follows.

모델들의 동적 부분들의 수렴이 좋다고 가정하면(즉, 에러벡터들이 정적 벡터에 비해 작다고 가정하면), 일반적으로 d1 < d2가 된다. 이것은 단순히 모델들이 등록된 화자를 구성 표시한다는 점(그 시점에서 이용가능한 데이터를 이용하여 등록 도중에 쉽게 체크될 수 있는 조건)을 가리키는 것이다. 흥미롭게도, e1과 e2가 테스트 신호 에러(Te1)에 비하여 작다면, 그 거리들은 e1 및 e2와 독립이 된다. 모델 에러와 비교할 때 테스트 에러가 크게 되는 조건은 텍스트-독립 테스트 조건들의 도중이다. 이것은 등록 음성 샘플들의 동적 성분들이 등록 모델들에 있어서 최소화되면 그러한 모델들의 텍스트-독립 화자 인식을 위한 좋은 기반을 제공할 수 있다는 점을 보여주고 있다.Assuming that the convergence of the dynamic parts of the models is good (i.e., the error vectors are small compared to the static vector), then d1 <d2. This simply indicates that the models construct and display the registered speaker (a condition that can be easily checked during registration using the data available at that point in time). Interestingly, if e1 and e2 are small compared to the test signal error Te1, the distances are independent of e1 and e2. The condition under which the test error becomes large when compared to the model error is during the text-independent test conditions. This shows that if the dynamic components of the registration speech samples are minimized in the registration models, they can provide a good basis for text-independent speaker recognition of such models.

상술한 에러(e1 및 e2)는 평균 모델 구성 에러이고; 실제 에러들은 프레임 단위로 되어 있고 평균을 중심으로 분포를 취할 것이다. 이 분포는 여러 가지 방식으로 모델화될 수 있을 것이며, 그것은 k-민즈(k-means)와 같은 표준 클러스터링 기술을 그 분포를 모델링하는데 이용함으로써 가장 간단하게 된다. k-민즈 클러스터링의 이용은 벡터 정량화(VQ)로서 다른 형태들로 알려져 있기도 하며, 또한 코호넨 인공 신경망(Kohonen Artificial Neural Netowork)으로 알려져 있기도 한 셀프조직 맵(Self Organizing Map(SOM))의 주요 부분이다.The above mentioned errors e1 and e2 are average model configuration errors; The actual errors are in frames and will be distributed around the mean. This distribution may be modeled in many ways, and it is simplest by using standard clustering techniques such as k-means to model the distribution. The use of k-minz clustering is known as other forms of vector quantification (VQ), and is a major part of the Self Organizing Map (SOM), also known as Kohonen Artificial Neural Netowork. to be.

테스트 발음이 2개의 모델에 적용되고 선택된 가장 근접한 것이 식별의 변화량(variant)이라는 시스템이 바로 위에서 기술되었다. 상술한 경우에, 등록된 화자로서 화자 1 혹은 화자 2가 그들 자신이라고 요구하고 테스트를 받으면, 그들은 항상 진짜로서 테스트될 것이며, 따라서 거짓 거절 율(False Rejection Rate (FRR)) = 0가 된다. 만약 알려지지 않은 화자가 화자 1 혹은 화자 2라고 요구하면, 그 화자는 한쪽 혹은 또 다른 한쪽으로서 분류되어, 1/2의 성공 확률이 되고, 따라서 거짓 허용 확률(False Acceptance Rate (FAR)) = 50%가 된다. 동일한 개수의 진정한 화자 테스트들과 랜덤 임포스터 테스트들이 수행되면, 전체 에러율을 (FRR + RAR)/2 = (0+0.5)/2 = 25%로서 계산할 수 있다.Directly described above is a system in which test pronunciation is applied to two models and the closest one chosen is the variation of identification. In the case described above, if either Speaker 1 or Speaker 2 is registered as the registered speaker and they are tested, they will always be tested as real, thus False Rejection Rate (FRR) = 0. If an unknown speaker requires speaker 1 or speaker 2, the speaker is classified as one or the other, with a 1/2 chance of success, so False Acceptance Rate (FAR) = 50% Becomes Once the same number of true speaker tests and random imposter tests are performed, the overall error rate can be calculated as (FRR + RAR) / 2 = (0 + 0.5) / 2 = 25%.

그것에 대한 시험 발음이 테스트된 모델들(그 집단(cohort))의 개수가 상기 FAR에 영향을 미치며, 그 개수는 그 코호트가 증가할수록 감소할 것이라는 점은 명백한 것이다. 이들 조건 하에서 인식 정확도가 증가하는 코호트 크기의 경우 FRR=0이기 때문에 100%에 점근성(asymptotic)이지만 그 정확도는 다음과 같으므로 더욱 일반론(general terms)으로서는 그것은 100-FRR에 점근성이다.It is clear that the number of models tested for it (the cohort) affects the FAR, and that number will decrease as the cohort increases. For cohort sizes where the recognition accuracy increases under these conditions, it is asymptotic at 100% because FRR = 0, but the accuracy is as follows, and as a more general term it is asymptotic at 100-FRR.

이 시점에서 FRR 및 FAR이 크게 디커플된다는 점을 주목할 필요가 있다. 즉, FRR은 생성된 모델의 품질에 의해 고정되고, FAR은 코호트 크기에 의해 고정된다. 또한, 에러율을 절반으로 하기 위해서는 코호트 크기를 2배, 예를 들어 99%의 정확도를 위해서는 코호트는 50이고, 99.5%의 정확도를 위해서는 코호트는 100이며, 99.75%의 정확도를 위해서는 코호트는 200이다. 코호트가 증가함에 따라, 연산 부하가 증가하며, 사실 에러율을 절반으로 할 때마다 연산 부하는 2배가 된다. 코호트가 매우 큰 개수로 증가하면, FRR 및 FAR은 브레이크 다운될 것이며, FRR은 증가하기 시작할 것이다.It is to be noted that at this point the FRR and FAR are greatly decoupled. That is, the FRR is fixed by the quality of the generated model and the FAR is fixed by the cohort size. Also, to halve the error rate, the cohort size is doubled, for example, the cohort is 50 for 99% accuracy, the cohort is 100 for 99.5% accuracy, and the cohort is 200 for 99.75% accuracy. As the cohort increases, the computational load increases, and in fact, every half the error rate doubles the computational load. If the cohort increases by a very large number, FRR and FAR will break down and FRR will begin to increase.

코호트 크기를 계속적으로 증가시키는 것 대신에, FAR을 감소시키기기 위해 다른 방법이 필요하다. 그 방법은 본 발명의 한 측면에 따르면, 병렬 처리를 이용하는 것으로(본 발명의 상세한 설명의 다른 곳에서 논의됨), 이것은 서로 약간 다른 임포스터 특성들을 보이며, 따라서 식별자 방식(identifier strategy)에 대해서 부분적으로는 통계적으로 독립적이 된다. 이러한 아이디어는 제로 혹은 거의 제로의 FRR을 보이며, 코호트 크기에 의해 설정된 FAR을 갖는 코어 식별자(core identifier)를 취하는 것이다. 그리고 이러한 코어 식별자의 처음과 끝의 처리는 약간 수정되어 코호트 멤버 모델들의 진정한 화자 모델로부터의 거리들을 재배열(reorder)하게 된다. 이것은 FRR~0을 유지하면서 행해지게 되고, 델타 셉스(delta-ceps) 등을 이용하는 것과 같은 변환된 계수들을 변경하거나 스펙트럼 정형 필터(24a-24n)(도 7참조)를 변경함으로써 달성될 수 있다.Instead of constantly increasing the cohort size, another method is needed to reduce the FAR. The method, according to one aspect of the present invention, employs parallel processing (discussed elsewhere in the detailed description of the present invention), which exhibits slightly different imposter characteristics, and thus partially with respect to an identifier strategy. Is statistically independent. The idea is to take a core identifier with a FAR set to a cohort size, with a FRR of zero or nearly zero. And the processing of the beginning and end of this core identifier is slightly modified to reorder the distances from the true speaker model of the cohort member models. This is done while maintaining FRR ~ 0, and can be achieved by changing the transformed coefficients, such as using delta-ceps or the like, or by changing the spectral shaping filters 24a-24n (see FIG. 7).

등록된 화자가 시스템을 사용할 때 테스트 신호는 병렬로 모든 프로세스에 적용되지만 각 프로세스는 FRR∼0을 가지며 화자는 이를 통과할 것이다. 알려지지 않은 임포스터(imposter)가 시스템을 사용할 경우에는 상기 임포스터는 각 프로세스를 1/cohort_size의 확률로 통과할 것이다. 그러나 본 발명은 병렬 프로세스에 의한 조건부 확률(conditional probabilities)을 도입하였다. 즉, 임포스터가 변경된 프로세스 2를 통과하면서 그것으로 판정될 가능성이 있는 프로세스 1을 통과한다는 식으로 조건부 확률을 도입하였다. 비록 임포스터가 모든 프로세스들을 통과할 확률이 통계적으로 독립적인 경우("stastically_independent_result=process_prob^{no_pf_process"})가 아닐지라도 프로세스 추가에 의해 확률이 감소된다. 소정의 프로세스의 FAR값을 위해, 전체 시스템의 정확도가 프로세스의 전체 개수에 따라 증가된다고 볼 수 있다.When a registered speaker uses the system, the test signal applies to all processes in parallel, but each process has a FRR to 0 and the speaker will pass through it. If an unknown imposter uses the system, the imposter will pass each process with a probability of 1 / cohort_size. However, the present invention introduces conditional probabilities by parallel processes. In other words, conditional probabilities were introduced in such a way that the imposter passed through modified process 2 and then passed through process 1, which is likely to be determined. Although the probability that the imposter will pass through all processes is not statistically independent ("stastically_independent_result = process_prob ^{no_pf_process"} ), the probability is reduced by adding the process. For the FAR value of a given process, it can be seen that the accuracy of the overall system increases with the total number of processes.

이런 방식으로 다중 병렬 프로세스가 사용되면, 테스트 샘플(test sample)과 요구된 식별정보(claimed identity)를 서로 정합(matching)시키는 기술이 각 프로세스에서의 성공적 정합(match)을 필요로 하거나 기설정된 소정의 성공적 정합율을 필요로 할 수도 있다.When multiple parallel processes are used in this way, a technique for matching a test sample and the claimed identity with each other requires a successful match in each process or a predetermined predetermined value. A successful matching rate may be required.

본 발명에 따른 등록 모델을 구성할 때, 대량의 샘플 프레임 오버랩(massive sample frame overlaps)과 벡터 양자화 (혹은 등가물(equivalent))을 조합하여 사용하면 매우 특별한 장점을 제공한다. 대량의 오버랩핑(massive overlapping)은 발음(utterance) 테스트를 하는 시점에 적용될 수 있지만 모델들을 구성할 때에도 적용된다. 이 기술은 가능한 많은 수의 얼라인먼트(alignment)를 생성하기 위해 일반적으로 80~90%의 대량 프레임 오버랩(massive frame overlap)을 사용한다. 얼라인먼트에 의해 생성된 프레임들은 전체 얼라인먼트를 나타내는 계수 매트릭스(행렬)를 생성하기 위해 LPCC변환을 사용하여 대표적 계수로 변환된다. 이로써 종래의 프레임 얼라인먼트 문제들을 해소할 수 있다. 일반적으로 행렬은, 예를 들어 1000 x 24와 같이, "no_of _frames x LPCC_order"(no_of _frames by LPCC_order)의 크기를 갖는다. 이것은 모델을 구성하기 위한 모든 발음에 대하여 반복되며, 일반적으로 10번의 반복에 의해 10개의 1000 x 24 행렬을 생성한다. 그 다음에 벡터 양자화는 화자모델(model for speaker)을 생성하기 위한 데이터를 줄이기 위해 사용된다. 이것은 샘플링된 스피치 데이터(sampled speech data)의 동적 요소의 중요성(significance)을 감소시키기 위하여 프레임을 평균화시키는 효과를 갖고 있다. 최종 결과의 모델은 테스트 발음에서 프레임의 위치를 인식(cognisance)하지 못하므로 본질적으로 시간에 따라 변화하는 모델이 아니다. 이것은 시간적 종속성(temporal dependency) 문제를 유발하게 된다.When constructing the registration model according to the present invention, a combination of massive sample frame overlaps and vector quantization (or equivalent) provides very special advantages. Massive overlapping can be applied at the time of the utterance test, but also when constructing models. This technique typically uses 80 to 90 percent of massive frame overlap to produce as many alignments as possible. The frames generated by the alignment are transformed into representative coefficients using the LPCC transform to produce a coefficient matrix (matrix) representing the entire alignment. This can solve the conventional frame alignment problems. In general, a matrix has a size of "no_of _frames x LPCC_order" (no_of _frames by LPCC_order), for example 1000 x 24. This is repeated for all the pronunciations that make up the model, and typically generates 10 1000 x 24 matrices with 10 iterations. Vector quantization is then used to reduce the data for creating a model for speaker. This has the effect of averaging frames in order to reduce the significance of the dynamic elements of the sampled speech data. The model of the final result is not essentially a model that changes over time because it does not recognize the position of the frame in the test pronunciation. This leads to a temporal dependency problem.

VQ와 대량 프레임 오버랩핑(massive frame overlapping)의 조합은 HMM/DTW을 기반으로 하는 종래의 시스템과는 다른 동작모드를 생성한다. HMM/DTW 에서는 모든 프레임들이 동등하게 유효하며 "예/아니오"의 임계작용(thresholding)을 수행하기 위한 최종 스코어(score)를 형성하는데 사용된다. 본 발명은 테스트 샘플 데이터의 모든 행(row)(프레임)을 요구된 화자와 관련 임포스터 코호트(cohort)를 위한 등록모델 데이터의 모든 행(row)과 비교하여 테스트한다. 테스트 샘플 데이터의 각 행에 대하여 최적의 정합성(match)은 등록 모델의 한 행에서 얻어질 수 있으며 관련 등록모델에 대한 테스트 스코어를 생성한다. 테스트 샘플은 최고 스코어를 생성하는 등록모델과 정합된다. 이 때, 정합정보가 요구된 식별정보로 판정되면 테스트 화자는 수락된다. 정합정보가 임포스터로 판정되면 테스트화자는 거부된다.The combination of VQ and massive frame overlapping creates a different mode of operation from conventional systems based on HMM / DTW. In HMM / DTW all frames are equally valid and are used to form a final score for performing a “yes / no” thresholding. The invention tests by comparing every row (frame) of test sample data with every row of registration model data for the requested speaker and associated imposter cohort. The best match for each row of test sample data can be obtained from one row of the registration model and generates a test score for the relevant registration model. The test sample is matched with a registration model that produces the highest score. At this time, if the matching information is determined to be the required identification information, the test speaker is accepted. If the match information is determined to be an imposter, the test speaker is rejected.

본 시스템은 대량 샘플 프레임 오버랩핑(massive overlapping of the sample frames)과 함께 LPCC와 VQ 모델링(또는 유사/동일 스펙트럼 분석과 클러스터링 기술)을 사용하여 각 등록화자를 위한 기준(reference) 모델을 생성하며 이것은 데이터베이스에 저장된다. 시스템을 사용할 때 입력 테스트 발음은 등록된 화자 데이터 집합(data-set)에 대하여 검사될 수 있는 입력 테스트 모델을 얻기 위하여 유사 스펙트럼분석에 적용된다. 이러한 접근방식은 실질적으로 0과 같은 아주 낮은 오류거부율(FRR)을 얻기 위해 적용될 수 있다는 장점이 있다. 이것의 중요성에 대해서는 이하에서 더욱 상세히 설명된다.The system uses LPCC and VQ modeling (or similar / same spectrum analysis and clustering techniques) with massive overlapping of the sample frames to generate a reference model for each registrant. It is stored in the database. When using the system, the input test pronunciation is applied to pseudospectral analysis to obtain an input test model that can be examined against a registered speaker data-set. This approach has the advantage that it can be applied to achieve a very low FRR, such as zero. The importance of this is explained in more detail below.

병렬 모델링Parallel modeling

전술한 바와 같이, 본 발명에 따른 화자인식시스템의 성능은 모델생성을 위해 다중병렬 프로세스를 사용함으로써 향상될 수 있다.As mentioned above, the performance of the speaker recognition system according to the present invention can be improved by using a multi-parallel process for model generation.

도 7을 참조하면, 본 발명의 일실시예에 따라 병렬 모델링 프로세스를 적용한 화자인식시스템의 바람직한 실시예는, 시스템으로 스피치 샘플을 나타내는 입력 신호를 입력하기 위한 입력 채널(100), 채널 정규화 프로세스(200), 다수의 병렬 신호 처리 채널 (102a, 102b ...102n), 분류(classification)모듈(110)과 출력 채널(112)로 구성되어 있다. 상기 시스템은 등록된 화자 데이터 집합(114), 즉 시스템을 사용하는 등록된 화자로부터 획득한 스피치 모듈 데이터베이스를 더 포함한다. 스피치 샘플 데이터는 프로세싱 채널(102a-n) 각각에 의해 병렬로 처리되고,각 프로세싱 채널의 출력은 분류모듈(110)에 입력되며, 상기 분류모듈은 등록된 화자 데이터의 데이터베이스(114)와 통신한다. 그리고 테스트 발음의 소스의 식별정보(identity)에 관한 판단(decision)정보가 출력채널(112)를 통해 출력된다.Referring to FIG. 7, a preferred embodiment of a speaker recognition system employing a parallel modeling process according to an embodiment of the present invention includes an input channel 100 and a channel normalization process for inputting an input signal representing speech samples to the system. 200, a plurality of parallel signal processing channels 102a, 102b ... 102n, a classification module 110 and an output channel 112. The system further includes a speech module database obtained from a registered speaker data set 114, i.e., a registered speaker using the system. Speech sample data is processed in parallel by each of the processing channels 102a-n, and the output of each processing channel is input to a classification module 110, which communicates with a database 114 of registered speaker data. . Decision information regarding the identity of the source of the test pronunciation is output through the output channel 112.

각 프로세싱 채널(102a-n)은 순차적으로 스펙트럼 성형 필터(spectral shaping filter)24a-n, (선택적) 추가적인 잡음 입력(added noise input)206a-n, 스펙트럼 분석 모듈(spectral analysis module)26a-n, 및 통계 분석 모듈(28a-n)로 구성되어 있다. 각 통계분석모듈(28a-n)로부터의 출력은 분류모듈(110)로 입력된다.Each processing channel 102a-n is sequentially subjected to a spectral shaping filter 24a-n, (optional) added noise input 206a-n, spectral analysis module 26a-n, And statistical analysis modules 28a-n. The output from each statistical analysis module 28a-n is input to the classification module 110.

스펙트럼 성형 필터(24a-n)는 발음신호를 다수의 중첩 주파수 대역(overlapping frequency bands)으로 분리하는 필터들의 뱅크(bank)로 구성되어 있으며 각 주파수 대역은 연속된 모듈(26a-n, 28a-n)에 의해 병렬로 처리된다. 프로세싱 채널의 수와 주파수 대역의 수는 입력데이터의 연속적 분석에서 더욱 자세한 정보를 제공하기 위해 얼마나 더 많은 채널을 포함하는지 여부에 따라 달라진다. 최소 2개의 채널을 채용하는 것이 바람직하며, 보다 더 바람직한 것은 최소 4개의 채널을 채용하는 것이다. 필터(24a-n)는 로우패스(low pass), 밴드패스(band pass), 하이패스(high pass)필터 뱅크를 구성한다. 베이스 필터(24a)의 대역폭은 첫번째 채널(102a)의 출력에 대한 연속적인 분석에 따른 오류거부율(FRR)이 "0" 이거나 "0"에 가능한 가까운 값으로 선택된다. 연속된 필터들(24b-n)은 입력채널(100)로부터의 신호를 점진적으로 많이 통과시키기 위하여 점차 증가하는 대역폭을 갖는다. 따라서, 다른 채널출력들은 조금씩 다른 오류수락(FA)특성을 갖지만 각 채널(102a-n)의 출력에 대한 오류거부율(FRR)은 "0"에 가깝게 유지된다. 채널(102a-n)들의 조합된 출력에 대한 분석은 "0"에 가까운 FRR과 감소된 전체 오류수락(FA)율(요구된 식별정보는 모든 채널들의 출력이 수락되었을 경우에만 수락된다)을 나타낸다. 이것의 중요성에 대해서는 이하 상세히 설명한다.The spectral shaping filter 24a-n consists of a bank of filters that separate the pronunciation signal into a plurality of overlapping frequency bands, each frequency band being a continuous module 26a-n or 28a-n. In parallel. The number of processing channels and number of frequency bands depends on how many more channels are included to provide more detailed information in subsequent analysis of the input data. It is preferred to employ at least two channels, even more preferred to employ at least four channels. The filters 24a-n constitute a low pass, band pass, and high pass filter bank. The bandwidth of the base filter 24a is selected such that the error rejection rate FRR according to successive analysis of the output of the first channel 102a is "0" or as close as possible to "0". The continuous filters 24b-n have a gradually increasing bandwidth to pass progressively more signals from the input channel 100. Thus, different channel outputs have slightly different error acceptance (FA) characteristics, but the error rejection rate (FRR) for the output of each channel 102a-n remains close to " 0 ". Analysis of the combined output of the channels 102a-n shows an FRR close to " 0 " and a reduced overall error acceptance rate (required identification is only accepted if the output of all channels has been accepted). . The importance of this is explained in detail below.

다중 주파수 대역의 사용은 연속적인 통계분석에서 관심특징벡터(the feature vector of interest)들의 크기를 증가시키므로 종래의 단일 채널 스펙트럼분석방식을 개선한다.The use of multiple frequency bands increases the size of the feature vectors of interest in continuous statistical analysis, thus improving the conventional single channel spectral analysis.

여러 종류의 병렬처리가 상기 모델링 프로세스에 사용되어, 입력 스피치 샘플의 서로 다른 여러 면을 모델링하는 여러 개의 특징 집합(multiple feature sets)을 제공하고 및/또한 이와 유사한 다른 모델도 제공할 수 있다. 로우패스필터에 부가하여 혹은 로우패스필터 대신에 다른 형태의 필터 뱅크도 사용될 수 있다. 스펙트럼적(spectral) 및/또는 통계적 분석 기술의 여러 가지 종류와 변형(variants)들이 병렬 처리 채널에서 사용될 수 있다. 병렬적 통계 분석은 특징 계수의 집합(sets)에 서로 다른 가중치를 부가하여 정해진 범위를 조금 벗어난(slightly deviated) 상태에 있는 모델들의 집합을 얻을 수 있다.Several kinds of parallelism may be used in the modeling process to provide multiple feature sets for modeling different facets of the input speech sample and / or other similar models. Other types of filter banks may be used in addition to or in place of the low pass filter. Various types and variations of spectral and / or statistical analysis techniques can be used in parallel processing channels. Parallel statistical analysis adds different weights to sets of feature coefficients to obtain a set of models that are in a slightly deviated state.

도 7에 도시된 구조는 데이터베이스(114)에 저장된 등록모델을 획득하고, 또한 상기 등록모델에 대하여 테스트 스피치 샘플들을 처리하기 위하여 사용될 수 있다. 각 등록 모델은 각각의 등록 발음을 위한 데이터집합(data-set)을 포함할 수 있다. 각 등록 발음을 위하여, 병렬 모델링 프로세스 각각의 출력을 나타내는 데이터의 매트릭스(행렬)가 존재할 수 있다. 이러한 매트릭스 각각은 클러스터링되고평균화처리된 (clustered/averaged) 스펙트럼 특징 벡터를 나타낸다. 테스트 샘플 데이터는 클러스터링/평균화 처리없이 동일한 병렬 스펙트럼 분석 프로세스에 의해 제어될 수 있다. 따라서, 테스트 모델 데이터는 병렬 모델링 프로세스 각각을 위한 스펙트럼 분석 데이터를 나타낸 매트릭스로 구성된다. 테스트 모델을 등록 모델에 대하여 테스트할 경우에, 특정한 모델링 프로세스를 나타내는 테스트 매트릭스를 동일한 모델링 프로세스에 의해 생성된 등록 매트릭스와 비교하여 테스트한다.The structure shown in FIG. 7 can be used to obtain a registration model stored in the database 114 and also to process test speech samples for the registration model. Each registration model may include a data set for each registration pronunciation. For each registration pronunciation, there may be a matrix of data representing the output of each parallel modeling process. Each of these matrices represents a clustered / averaged spectral feature vector. Test sample data can be controlled by the same parallel spectrum analysis process without clustering / averaging process. Thus, the test model data consists of a matrix representing spectral analysis data for each of the parallel modeling processes. When testing a test model against a registration model, a test matrix representing a particular modeling process is tested by comparing the registration matrix generated by the same modeling process.

분류화 (classification)Classification

상술한 모델링 기술에 의해 얻어지는 기준 모델의 특성은 기준모델들 자신이 직접 모델-대-모델 비교를 행하도록 하는 것이다. 이것은 시스템이 식별방법(identifider strategy)을 이용하도록 하며, 상기 식별방법에서 각각의 등록 모델은 임포스터 코호트와 관련되어 있다. 즉, 각 등록 화자(주체, subject)의 기준 혹은 참조(reference) 모델로서, 다른 등록 화자들의 소정 개수의 기준 모델로 구성된 임포스터 코호트가 사용된다. 이 때, 다른 등록 화자들은 상기 주체 등록 화자와는 명백히 구별되며, 상기 주체화자의 기준모델과는 기알려지고 예측가능한 관계를 갖는다. 이러한 예측가능한 관계는 상기 시스템의 성능이 개선될 수 있도록 한다. 도 11은 도 3과 유사한 방식으로 종래의 화자 인식 시스템에 의해 얻어진 효과를 나타낸 것으로서, 8명의 화자를 위한 기준데이터에 대해 검사된 입력 발음의 스코어를 비교한 것이다. 화자(1)는 실제의 "참(true)" 화자이다. 그러나, 일부 다른 화자들의 스코어는 서로 너무 근접하기 때문에 상기 시스템이 정확한 화자로 식별할 수 있는 확신(confidence)의 정도를 현저하게 떨어뜨린다. 도 11b는 본 발명에 따른 시스템을 이용하여 얻어지는 등가 결과(equivalent results)를 보인 것이다. 화자(1)의 결과는 나머지 화자들(2 내지 8)의 결과와 매우 확실히 구별된다.The characteristic of the reference model obtained by the modeling technique described above is that the reference models themselves make a model-to-model comparison directly. This allows the system to use an identification strategy, where each registration model is associated with an imposter cohort. That is, as a reference or reference model of each registered speaker (subject, subject), an imposter cohort composed of a predetermined number of reference models of other registered speakers is used. At this time, the other registered speakers are clearly distinguished from the subject registered speakers, and have a known and predictable relationship with the subject model of the subject speaker. This predictable relationship allows the performance of the system to be improved. FIG. 11 shows the effect obtained by the conventional speaker recognition system in a manner similar to that of FIG. 3, comparing scores of input pronunciations checked against reference data for eight speakers. The speaker 1 is the actual "true" speaker. However, the scores of some other speakers are so close to each other that the system significantly reduces the degree of confidence that the system can identify with the correct speaker. 11B shows equivalent results obtained using the system according to the present invention. The result of the speaker 1 is very clearly distinguished from the result of the remaining speakers 2 to 8.

본 발명의 실시예에 적용된 화자 모델링 방법은 본질적으로, HMM 등의 종래의 기술과 가우시안 혼합 모델(gaussian mixture models)등의 가능한 다른 방법들을 사용할 경우에 비하여 매우 간단하다. 그러나, 본 출원인은 종래의 고정된(tight) 통계적 방법을 사용할 경우에 종래 방법에서 본질적으로 결함이 있기 때문에 실제 세계("real world")에 적용시 불량 성능으로 귀결된다고 확신하고 있다. 이러한 점에 기초하여, 본 발명은 상대적으로 실제 경우에서 훨씬 간단한 통계방법을 제공하는 효과가 있다. 전술한 바와 같이, HMM의 시간에 따른 특성은 흉내내기 쉬운 문제가 있는 바, 본 발명은 이러한 문제를 해소하고자 한다. 나아가, 본 발명의 모델들은 모델-대-모델 테스트를 수행함으로써 등록된 화자 데이터 집합의 구조를 이상적으로 분석할 수 있다.The speaker modeling method applied to the embodiment of the present invention is inherently simpler than using conventional techniques such as HMM and other possible methods such as Gaussian mixture models. However, the Applicant is convinced that the use of conventional tight statistical methods results in poor performance when applied to the "real world" because they are inherently flawed in the conventional methods. Based on this, the present invention has the effect of providing a much simpler statistical method in a relatively real case. As described above, there is a problem that the time-dependent characteristics of the HMM are easy to imitate, and the present invention seeks to solve this problem. Furthermore, the models of the present invention can ideally analyze the structure of the registered speaker data set by performing model-to-model tests.

본 발명의 화자 모델을 이용하여 모델-대-모델의 비교를 수행하면 두 가지의 장점이 있다. 먼저, 본 발명의 화자 모델은 등록된 화자 데이터 집합(data-set) (즉, 데이터 집합은 서로 매우 유사(혹은 근접)하며 특정 모델 주위에 균일하게 분포됨)내에 존재하는 가장 관련성이 높은 임포스터들(the most relevant impostors)을 식별할 수 있는 기능과, 효과적이고 예측가능한 스피커 정규화(normalisation) 매커니즘을 제공한다. VQ 모델링은 모델의 크기를 선택하는 프로세스, 즉, 계수(센터 "centres")의 수를 선택하는 처리를 필요로 한다. 모델 크기가 선택되면, 센터 위치들이 모든 등록 데이터 벡터에 가장 적합하게 될 때까지 움직인다. 이것은 센터가 등록 벡터의 클러스터에 효과적으로 할당되었다는 것을 의미하므로, 모델내 각 센터는 화자의 식별정보(identity)에 중요한 정보 클러스터를 의미한다.There are two advantages to performing model-to-model comparison using the speaker model of the present invention. First, the speaker model of the present invention is the most relevant imposter present in registered speaker data-sets (ie, the data sets are very similar (or close to each other) and are evenly distributed around a particular model). Provide the ability to identify the most relevant impostors and an effective and predictable speaker normalization mechanism. VQ modeling requires the process of selecting the size of the model, i.e. the process of selecting the number of coefficients (center "centres"). Once the model size is selected, the center positions are moved until they are best suited for all registration data vectors. This means that the center is effectively assigned to a cluster of registration vectors, so each center in the model represents a cluster of information that is important for the speaker's identity.

모델-대-모델 테스트는 등록된 화자 혹은 요구된(claimed) 식별정보가 넓은 의미(broad sense)의 데이터베이스 혹은 상기 요구된 식별정보에 한정된 영역(즉, 시스템 데이터공간)의 데이터베이스에 대하여 어떠한 방식으로 등록되고 요구되는지 예측할 수 있도록 한다. 도 12는 화자(1)를 위한 기준모델에 대하여 복수의 화자들(2 내지 8)을 위한 기준모델을 테스트한 결과를 보인 것이다. 타원표기는 별(stars) 표기가 모델 1에 대해 테스트된 화자 발음(혹은 말투,utterances)의 실제 스코어(score)를 나타낼 경우의 모델-대-모델의 결과를 나타낸다. 모델-대-모델 테스트는 특정 기준 모델에 대한 특정 화자의 실제 성능(performance)을 예측하는데 사용될 수 있다. 모델-대-모델의 결과는 실제 스코어 분포에서 제일 낮은 수준에 존재하는 경향이 있어, 특정 임포스터가 모델 1에 얼마나 잘 부합하는지를 나타낸다. 이와 같이 실제 성능을 예측하기 위하여 모델-대-모델 테스트를 이용하는 기본 접근방법(approach)은 이미 알려진 것이다. 이하 상술될 바와 같이, 본 발명의 일실시예에 따라 상기 접근방법은 각각 선택되고 통계적으로 가변하는 그룹핑들(groupings)을 이용하여 임포스터로부터 특정 모델을 보호하도록 확장될 수 있다.The model-to-model test is in some way done against a registered speaker or a database of broad sense where the claimed identification is defined or a database of areas (i.e., system data spaces) restricted to the required identification. Allows you to predict if it is registered and required FIG. 12 shows a result of testing a reference model for a plurality of speakers 2 to 8 with respect to the reference model for the speaker 1. The ellipsoid represents the result of the model-to-model when the star notation represents the actual score of the speaker pronunciations (or utterances) tested for Model 1. Model-to-model tests can be used to predict the actual performance of a particular speaker for a particular reference model. Model-to-model results tend to be at the lowest level in the actual score distribution, indicating how well a particular imposter fits model 1. As such, the basic approach of using model-to-model testing to predict actual performance is well known. As will be described in detail below, in accordance with one embodiment of the present invention, the approach can be extended to protect a particular model from an imposter using respective selected and statistically variable groupings.

모델-대-모델 테스트로부터 얻어지는 두 번째 장점은 소수의, 혹은 필요하다면, 등록된 모든 화자 모델에 대한 테스트 발음(utterance)의 성능을 예측할 수 있다는 것이다. 이 장점은 가상의 무한(virtually unlimited) 개수의 테스트 패턴들이 식별정보를 확인하는데 사용되도록 하는데, 이러한 장점은 종래의 시스템에서는 이루어질 수 없는 것이다.A second advantage obtained from model-to-model testing is that it can predict the performance of test utterances for a few, or if necessary, all registered speaker models. This advantage allows a virtually unlimited number of test patterns to be used to verify the identification, which is not possible with conventional systems.

또한, 모델-대-모델 테스트 결과는 각 참조 모델과 더불어 사용되는 특정의 임포스터 코호트를 조합(assemble)하는데 사용될 수 있다. 이는 정확한 스코어 정규화를 이루고 또한 각 모델이 각각의 등록 화자를 위하여 선택된 통계적으로 가변하는 그룹핑을 이용하여 임포스터들로부터 효과적으로 보호되도록 한다. 이러한 개념은 도 3에 도시되어 있다. 각 기준 모델은 다차원 데이터 공간 내에 존재하는 포인트(point)로 간주될 수 있으므로, 모델간의 거리가 산출될 수 있다. 도 13은 이러한 개념의 명확성을 위하여 이차원으로 도시한 것이다. 도 13에서 보듯이, 각각의 별표 기호는 모델을 나타내고, 이차원적 거리는 모델간 거리를 나타낸다.In addition, model-to-model test results can be used to assemble the particular imposter cohort used with each reference model. This achieves accurate score normalization and also ensures that each model is effectively protected from imposters using statistically variable groupings selected for each registered speaker. This concept is illustrated in FIG. 3. Since each reference model can be regarded as a point existing in the multidimensional data space, the distance between the models can be calculated. 13 is shown two-dimensionally for clarity of this concept. As shown in Fig. 13, each asterisk represents a model, and a two-dimensional distance represents a distance between models.

화자 모델의 분포도가 균일하지 않기 때문에, 정규화 기술에 기초한 세계적 모델(world-model)이 모든 화자 모델에서 동일한 방식으로 적절히 동작되지는 않을 것이다. 소수의 화자 모델들은 상대적으로 서로 매우 밀접 혹은 유사한데, 이것은 임포스터가 등록된 화자들의 목소리를 성공적으로 흉내내어 말할 수 있는 잠재 가능성이 있음을 의미한다. 이러한 문제는 각 화자 모델을 위해, 주체(subject) 모델 주위에 임포스터들의 특정 코호트를 생성함으로서 해결될 수 있다. 이것은 정규화 프로세스를 간단화시키고, 임포스터들에 대한 보호자(가드, guard)를 생성한다. 이것은 도 14에 도시되어 있는데, 도 13과 유사한 방식으로, 대상체, 즉 주체모델은 원으로 표시되고, 임포스터 코호트의 멤버는 별표로 표시되고, 주체로 요구되는 임포스터의 스코어는 "x"로 표시된다. 임포스터 스코어는 충분히 인식문제를 유발시킬 수 있을 정도로 주체 모델에 매우 가깝게 위치되어 있다. 그러나, 화자 데이터-집합은 실제 주체 화자가 임포스터 코호트 모델과 얼마나 유사한지 예측할 수 있게 하기 때문에, 위의 정보는 코호트 멤버 모델 및 실제 대상체 모델에 대하여 임포스터를 테스트하여 실제 주체로부터 임포스터 "x"를 구별할 목적으로 사용될 수 있다. 즉, 임포스터의 발음(utterance) "x" 이 실제 목적 주체(즉, 대상체)에게 기대되는 것 이상으로 코호트 멤버 일부와 유사하고, 또한 기대되는 수준 이상으로 더욱 다른 사람들과 구별될 수도 있다. 이것은 임포스터 이벤트를 의미하고, 실제 대상체와의 정합성(match)으로 거절되는 임포스터 발음으로 나타난다.Because the speaker model's distribution is not uniform, a world-model based on normalization techniques will not work properly in the same way on all speaker models. A few speaker models are relatively very close or very similar to each other, meaning that the imposter has the potential to successfully mimic the voices of registered speakers. This problem can be solved by creating a specific cohort of imposters around the subject model for each speaker model. This simplifies the normalization process and creates a guard for the imposters. This is illustrated in FIG. 14, in a manner similar to FIG. 13, where the object, i.e. the subject model, is indicated by a circle, the members of the imposter cohort are indicated by an asterisk, and the score of the imposter required by the subject is "x". Is displayed. The imposter scores are so close to the subject model that they can cause cognitive problems. However, since the speaker data-set allows us to predict how similar the real subject speaker is to the imposter cohort model, the above information tests the imposter against the cohort member model and the real object model, so that the imposter "x May be used for the purpose of distinguishing between That is, the utterance "x" of the imposter is similar to some of the cohort members beyond what is expected of the actual subject (i.e., the subject), and may also be distinguished from others more than expected. This implies an imposter event and is represented by an imposter pronunciation that is rejected as a match with the actual subject.

이것은 두 단계의 인식 프로세스의 기초를 제공하는데, 이 인식 프로세스는 요구된 화자(claimed speaker)가 아닌 것으로 판정된 임포스터를, 필요하다면, 상기 요구된 화자로 판정될 만큼 매우 유사한 발음(utterances)에 적용된 더욱 상세한 프로세스를 사용하여 우선적으로 거절한다.This provides the basis for a two-step recognition process, in which the imposter determined to be not the claimed speaker, if necessary, with very similar utterances as determined by the requested speaker. Rejection is preferentially using the more detailed process applied.

화자 식별 시스템의 응용예(즉, 애플리케이션)에서, 오류거부(false rejections), 즉 사용자에 의해 요구된 식별정보가 가짜인 것으로 잘못 거부될 경우의 발생가능성을 최소화하는 것이 매우 중요하다. 본 발명의 일실시예에 따른 식별 방법(identifier strategy)은 매우 낮은 오류거부율을 제공하는 동시에, 예측가능한 시스템 성능을 제공하고, 요구된 식별정보를 승인하거나 거부할 때 임계치의 사용과 관련하여 발생되는 문제를 최소화할 수 있다.In an application of the speaker identification system (i.e., application), it is very important to minimize the possibility of false rejections, i.e., if the identification information requested by the user is falsely rejected as being fake. An identifier strategy according to one embodiment of the present invention provides very low error rejection rates, provides predictable system performance, and is generated in connection with the use of thresholds when accepting or rejecting the required identification. The problem can be minimized.

상기 방법에 따르면, 등록된 화자의 데이터베이스(즉, 화자 공간)는 여러 개로 나뉘어지므로, 시스템에 등록된 각 화자는 상술한 바와 같이 고정된 N개의 등록 화자들로 구성된 코호트에 할당된다. 시스템의 화자 분류 모듈(예, 도 4의 시스템의 모듈 110)이 동작하여, 입력 테스트 발음(utterance)이 상기 화자가 요구한 식별정보와 관련된 상기 코호트의 모든 멤버와 비교되고, 테스트 발음은 최고의 정합성(match)을 가진 상기 코호트의 해당 멤버의 것으로 분류된다. 즉, 테스트 발음은 코호트에 포함된 한 멤버와 항상 정합되고, 상기 코호트의 다른 멤버들과도 접합될 수 있을 것이다. 발음이 정합된 상기 코호트 멤버가 요구된 식별정보에 해당하면, 요구된 식별정보는 "참(true)"으로 간주된다. 발음이 상기 코호트의 다른 멤버와 정합되면, 요구된 식별정보는 거짓으로 간주되어 거절된다.According to the method, the database of registered speakers (i.e. speaker space) is divided into several, so that each speaker registered in the system is assigned to a cohort of fixed N registered speakers as described above. A speaker classification module of the system (e.g., module 110 of the system of FIG. 4) is operated such that an input test utterance is compared with all members of the cohort associated with the identification information requested by the speaker, and the test pronunciation is best matched. is classified as that member of the cohort with (match). That is, the test pronunciation is always matched with one member included in the cohort, and can also be joined with other members of the cohort. If the cohort member whose pronunciation is matched corresponds to the requested identification information, the requested identification information is regarded as "true". If the pronunciation matches another member of the cohort, the requested identification is considered false and is rejected.

상기 모델링 및 분류 프로세스는 조정(tuned)가능하므로 오류 거부율이 효과적으로 "0"(FR=0%)이 된다. 즉, 화자가 요구된 식별정보가 아닌 코호트 멤버로 잘못 식별될 확률이 실질적으로 "0"이다. 이것은 모델-대-모델 비교방법에 의해 용이해지므로, 정합성(match) 판단은 하나의 가장 유사한 모델과 정합하는 테스트 발음을 기준으로 할 뿐만 아니라 코호트의 다른 멤버들과의 관계성도 그 기준이 된다. 상기 코호트가 고정된 크기 "N"을 가지면, 최대가능한 오류 수락율(FA)이 "FA = 100/N%"이고, 전체 평균 에러율은 "(FA + FR)/2 = 50/N %"가 된다. 코호트 크기"N"가 20이면, 에러율은 2.5%(즉, 97.5%의 정확성)이다. 코호트 크기"N"가 고정되면, 시스템은 고정된 예측가능한 에러율을 유지하면서 어떠한 크기의 개체수(population)에도 확장될 수 있다. 즉, 시스템의 정확성은 코호트의 크기를기준으로 정해지고 일반적인 개체수의 크기에는 영향받지 않으므로, 상기 시스템이 매우 넓은 범위의 개체수에 적용될 수 있도록 한다. 오류거부율이 현저하게 증가되지 않는 범위내에서 코호트 크기를 증가시키면 정확성이 개선될 수 있다.The modeling and classification process is tuned so that the error rejection rate is effectively " 0 " (FR = 0%). That is, the probability that the speaker is incorrectly identified as a cohort member rather than the required identification information is substantially "0". Since this is facilitated by the model-to-model comparison method, the match determination is based not only on test pronunciation matching the one most similar model, but also on the relationship with other members of the cohort. If the cohort has a fixed size "N", the maximum possible error acceptance rate (FA) is "FA = 100 / N%" and the overall average error rate is "(FA + FR) / 2 = 50 / N%". do. If the cohort size "N" is 20, the error rate is 2.5% (ie 97.5% accuracy). If the cohort size "N" is fixed, the system can be extended to populations of any size while maintaining a fixed and predictable error rate. That is, the accuracy of the system is determined based on the size of the cohort and is not affected by the size of the general population, so that the system can be applied to a very wide range of population. Increasing the cohort size within a range where the error rejection rate does not increase significantly can improve the accuracy.

이 방법(strategy)은 결과를 결정하는데 임계값이 사용되는지 여부에는 관련되지 않지만, 이러한 임계값들은 여전히 오류수락율을 줄이는데 사용될 수 있다. 즉, 테스트 발음이 상술한 방법을 사용하여 요구된 식별정보와 정합되면, 상기 임계값들은 양자의 정합성(match)이 최종적으로 수락될 수 있을 만큼 충분히 서로 근접/유사한지를 결정하는 프로세스에 적용될 수 있다.This strategy does not matter whether or not thresholds are used to determine results, but these thresholds can still be used to reduce the error acceptance rate. That is, if the test pronunciation matches the required identification using the method described above, the thresholds can be applied to the process of determining whether the match of both is close enough / similar enough to be finally accepted. .

상술한 바와 같이, 특정 등록 모델과 관련된 임포스터 코호트를 선택할 경우소정의 알고리즘이 사용된다. 따라서, 임포스터 코호트의 멤버들은 알고 싶은 등록 모델과 특정한 관계를 갖는다. 대체로, 상기 분류 프로세스에서 최적화 정도(degree)를 제공할 수 있다. 그러나, 무작위적으로 선택된 임포스터 코호트는 대부분의 실제 목적과 동일하게 잘 부합되는 것으로 판정된다. 여기서, 가장 중요한 점은 코호트 크기가 예측가능한 성능치를 제공할 수 있도록 미리 결정되어야 한다는 것이다. 특정 등록 모델을 위한 임포스터 코호트는 등록시 혹은 테스트 발음을 검사할 때 선택될 수 있다.As mentioned above, a predetermined algorithm is used when selecting an imposter cohort associated with a particular registration model. Therefore, the members of the imposter cohort have a specific relationship to the registration model they want to know. In general, the degree of optimization can be provided in the classification process. However, randomly selected imposter cohorts are judged to fit the same well as most practical purposes. Here, the most important point is that the cohort size must be predetermined to provide predictable performance values. The imposter cohort for a particular registration model can be selected at registration or when testing test pronunciation.

병렬적 분류(Parallel Classification)Parallel Classification

본 발명에 따른 화자 인식 시스템의 성능은 다중 병렬 분류 프로세스를 사용하여 개선될 수 있다. 일반적으로, 이러한 과정은 통계적으로 독립되거나 혹은 부분적으로 독립될 수 있다. 이러한 접근방법은 다수의 분류 결과를 제공할 수 있는데, 이 분류 결과들은 도 5에 도시된 바와 같이 최종 결과를 이끌어 내도록 서로 결합될 수 있다.The performance of the speaker recognition system according to the present invention can be improved using multiple parallel classification processes. In general, this process can be statistically independent or partially independent. This approach can provide multiple classification results, which can be combined with each other to yield the final result as shown in FIG. 5.

예를 들면, 상술한 바와 같은 식별 방법을 사용하면, 동일한 테스트 발음이 서로 다른 코호트의 수, 서로 다른 등록 구문(phrases) 혹은 상기 테스트 발음 조합체(combinations)에 대하여 검사될 수 있다. 다중 코호트가 사용되면, 각 코호트는 결과치에 실질적으로 "0"(FA = 0%)의 오류거부율과 이전처럼 "FA = 100/N %"의 오류수락율을 적용할 것이다. 동일한 크기를 갖는 N개의 코호트를 위한 전체 오류수락율은 "FA = 100*M/Nⁿ%"이고, 평균 에러율은 "50*M/N Nⁿ%" 인데, 여기서 "M"은 1보다 큰 계수로서 완전히 통계적으로 독립되지 못한 상기 프로세스들의 효과를 나타낸다. 즉, 2개의 코호트가 사용되고 각 코호트 크기가 20일 경우, 평균에러율은 전술한 하나의 코호트를 사용하였을 경우의 2.5%와 비교하면 "0.125*M %"이 될 것이다. 임계값들 역시 기설명된 바와 같이 정확성을 향상시키는데 적용될 수 있다.For example, using the above-described identification method, the same test pronunciation can be checked for different number of cohorts, different registration phrases or the test pronunciation combinations. If multiple cohorts are used, each cohort will apply an error rejection rate of "0" (FA = 0%) and an error acceptance rate of "FA = 100 / N%" as before. The overall error acceptance rate for N cohorts of equal size is "FA = 100 * M / N ⁿ %", and the average error rate is "50 * M / NN ⁿ %", where "M" is a coefficient greater than 1 As the effect of the processes that are not completely statistically independent. That is, if two cohorts are used and each cohort size is 20, the average error rate will be "0.125 * M%" compared to 2.5% when using one cohort as described above. Thresholds may also be applied to improve accuracy as described previously.

부분적으로 통계적인(partially statistically) 독립 프로세스의 다른 유형들이 상기 모델링 프로세스, 상기 분류프로세스, 또는 양측 모두에 사용될 수 있다. 기설명된 실시예들 이외에도, 하나의 발음이 다수의 부분으로 나뉘어져서 별도로 처리될 수 있다.Other types of partially statistically independent processes may be used in the modeling process, the classification process, or both. In addition to the described embodiments, one pronunciation may be divided into a plurality of parts and processed separately.

정규화 (Normalisation)Normalization

종래의 화자 인식 시스템에서 발생되는 다른 문제점은 시스템 성능이 초기 등록 및 다음(subsequent)인식을 위하여 사용되는 스피치(speech) 샘플링 시스템들의 차이에 영향을 받는다는 것이다. 이러한 차이는 서로 다른 트랜스듀서(들)(즉, 마이크로폰(들)) 및 사운드카드(들) 등으로부터 발생된다. 본 발명의 다른 실시예에 따르면, 이러한 문제점은 각 샘플링 시스템 (혹은 가능하면, 샘플링 시스템의 각 유형)에서 획득/저장된 정규화 특성을 기준으로 스피치 샘플링을 정규화함으로써 해소되거나 완화될 수 있다. 상기 샘플링 시스템은 상기 인식 시스템에 스피치 샘플들을 입력하기 위하여 사용된다. 다른 방법으로서 (바람직하게는), 상기 정규화 특성은 스피치 샘플이 상기 시스템에 입력되면 온-더-플라이 "on the fly" 상태로 판정할 수 있다. 정규화 특성은 모든 입력 스피치 샘플에 적용될 수 있으므로, 기준모델 및 테스트 스코어는 특정 샘플링 시스템의 특성과 무관하다. 다른 방법으로서 혹은 부가적으로, 본 발명의 다른 실시예에 따르면, 정규화 프로세스는 등록 샘플 데이터에 대하여 테스트 샘플을 검사할 경우에 적용될 수 있다.Another problem that arises in conventional speaker recognition systems is that system performance is affected by differences in speech sampling systems used for initial registration and subsequent recognition. This difference arises from different transducer (s) (ie microphone (s)) and soundcard (s) and the like. According to another embodiment of the present invention, this problem can be solved or mitigated by normalizing speech sampling based on the normalization characteristic obtained / stored in each sampling system (or possibly each type of sampling system). The sampling system is used to input speech samples to the recognition system. Alternatively (preferably), the normalization characteristic may determine to be on-the-fly "on the fly" state when a speech sample is input to the system. Since the normalization characteristic can be applied to all input speech samples, the reference model and test scores are independent of the characteristics of the particular sampling system. Alternatively or additionally, according to another embodiment of the present invention, the normalization process may be applied when checking a test sample against registered sample data.

정규화 특성은 샘플링 시스템의 전달 함수로부터 효과적으로 유도할 수 있다. 예를 들면, 샘플링 시스템에 기알려진 기준 신호를 입력하고, 상기 스피치 인식 시스템을 통하여 샘플링된 기준신호를 처리함으로써 정규화 특성이 효율적으로 유도될 수 있다. 상기 인식 시스템으로부터 발생된 결과 신호는 저장되어, 이후 동일한 샘플링 시스템 혹은 동일한 종류의 샘플링 시스템을 통하여 연속적으로 입력되는 스피치 샘플(들)을 정규화시키는데 사용될 수 있다.The normalization characteristic can be effectively derived from the transfer function of the sampling system. For example, the normalization characteristic can be efficiently derived by inputting a known reference signal to the sampling system and processing the reference signal sampled through the speech recognition system. The resulting signal generated from the recognition system can be stored and then used to normalize the speech sample (s) that are subsequently input through the same sampling system or the same kind of sampling system.

또는, 도 15에 도시된 바와 같이, 입력 채널(300)의 전달함수 C(f)에 의해 변형된 스피치 신호 S(f)는, 채널(300)의 전달함수 C(f)를 예측하는 추정(estimating) 모듈(302)과 정규화모듈(304)에 변형된 스피치 신호"S(f)*C(f)"를 입력하고 상기 추정 전달함수 1/C(f)의 역(inverse) 값을 정규화 모듈에 입력함으로써 "온-더-플라이" 상태로 정규화 되어, 결과적으로 정규화 모듈의 출력이 입력신호 S(f)와 매우 유사하게 된다. 예측 모듈(302)은 채널(300)의 스펙트럼 특징에 따라 디지털 필터를 생성하고, 이 필터의 역(inverse)의 값은 신호를 정규화하기 위하여 사용된다. 예를 들면, 샘플신호의 스펙트럼 품질을 나타내는 모든 폴(all-pole) 필터를 결정하면 역(inverse) 필터를 계산할 수 있다. 필터 계수는 채널"C(f)"의 스펙트럼을 벗어난 신호가 가능한 제거될 수 있도록 프레임을 거쳐 평활처리(smoothen)된다. 이러한 채널 스펙트럼의 추정은 역필터"1/C(f)"를 생성하는데 사용된다. 이러한 기본적 접근 방법은 상기 극들의 바람직한 소거(intelligent cancellation)와 함께, 프레임을 위해 획득한 필터들의 폴(pole) 위치들이 평활화되도록 개선될 수 있다. 상기 폴(poles)은 채널 특성과는 관련되지 않은 것으로 알려져 있다.Alternatively, as shown in FIG. 15, the speech signal S (f) modified by the transfer function C (f) of the input channel 300 may be estimated to estimate the transfer function C (f) of the channel 300 ( Input the modified speech signal " S (f) * C (f) " into the estimating module 302 and the normalization module 304 and normalize the inverse value of the estimated transfer function 1 / C (f). By inputting to the signal, it is normalized to the "on-the-fly" state, so that the output of the normalization module becomes very similar to the input signal S (f). Prediction module 302 generates a digital filter according to the spectral characteristics of channel 300, and the inverse value of this filter is used to normalize the signal. For example, by determining all-pole filters that represent the spectral quality of a sample signal, an inverse filter can be calculated. The filter coefficients are smoothed across the frame so that signals outside the spectrum of the channel "C (f)" can be removed as much as possible. This estimation of the channel spectrum is used to produce an inverse filter " 1 / C (f) ". This basic approach can be improved so that the pole positions of the filters obtained for the frame are smoothed, along with the desired cancellation of the poles. It is known that the poles are not related to channel characteristics.

전달 함수/정규화 특성의 본질(nature)에 따르면, 정규화 프로세스는 화자 인식 시스템에 의해 처리되기 이전에 스피치 샘플에 인가될 수 있고, 혹은 스펙트럼 데이터, 혹은 상기 시스템에 의해 생성된 모델에 인가될 수도 있다.According to the nature of the transfer function / normalization characteristic, the normalization process may be applied to speech samples before being processed by the speaker recognition system, or may be applied to spectral data, or models generated by the system. .

본 발명의 일실시예에 따른 바람직한 채널 정규화 방법은 테스트샘플-대-등록모델을 검사할 때 테스트 모델 데이터 및 관련 등록 모델에 적용된다.The preferred channel normalization method according to an embodiment of the present invention is applied to test model data and related registration models when examining test sample-to-registration models.

스피치 신호에 대한 채널 특성의 전체적 효과는 다음과 같이 표현될 수 있다.The overall effect of the channel characteristics on the speech signal can be expressed as follows.

여기서,는 화자 특성의 예측값(estimate)을 나타내고, cc(ω)는 채널 특성 혹은 적절히 변화된 채널 특성을 나타내고, 스피치 신호는 기설명된 바와 같이 고정(static)부분 및 동적(dynamic) 부분으로 구성된 것으로 간주되어 처리된다. 이상적으로는(ideally), 원하지 않는 채널 특성은 예측되고 제거될 수 있다. 실제로, 이러한 채널 특성 제거는 시간영역, 주파수영역, 혹은 그 조합체에서 이루어질 수 있다. 이들 양자 모두는 동일한 효과를 이루는데, 즉 역필터 혹은 스펙트럼 분할의 일부 형태를 이용하여 채널특성 "cc(ω)"을 예측하고 제거하는 효과를 얻는다.가 원치 않는 채널의 스펙트럼의 예측값이면, 이는 다음과 같은 수식으로 표현될 수 있다.here, Denotes a predictor of speaker characteristics, cc (ω) denotes a channel characteristic or an appropriately changed channel characteristic, and the speech signal is regarded as being composed of a static part and a dynamic part as described above. Is processed. Ideally, unwanted channel characteristics can be predicted and eliminated. In practice, such channel characteristic removal can be done in the time domain, frequency domain, or a combination thereof. Both of these achieve the same effect, ie the effect of predicting and eliminating the channel characteristic "cc (ω)" using some form of inverse filter or spectral segmentation. If is a predicted value of the spectrum of the unwanted channel, it can be expressed by the following equation.

만약 채널 특성의 예측치가 ""로 양호하다면, 스피치 신호의 예측치는 원치않는 스펙트럼이 소거되어 양호하다는 것을 의미한다. 이것은 보통 FFT에 기반한 알고리즘을 사용하여 구현될 수 있다.If the estimate of channel characteristics is " If good, then the predictive value of the speech signal means that the unwanted spectrum is canceled and good. This can usually be implemented using an algorithm based on FFT.

다른 구현예는 상기 채널 특성을 필터로 설계하여(model) 모든-폴(all-pole) 형태에 적절하도록 하는 것으로서, 이를 수식으로 표현하면 다음과 같다.Another embodiment is to model the channel characteristics as a filter (model) to be appropriate for the all-pole (all-pole) shape, which is expressed as follows.

위의 수식은 ARMA의 가장 기본 형태로서, 가능한 선형 예측법(linear prediction)을 사용하여 시간 신호로부터 직접 추출될 수 있다.The above equation is the most basic form of ARMA and can be extracted directly from the time signal using possible linear prediction.

유사 정규화는 Cepstral 표현을 기초로 실행가능하다.Similar normalization is feasible based on Cepstral expressions.

Cepstral 영역(domain)에서, 스피치 신호는 다음과 같이 나타낼 수 있다.In the Cepstral domain, the speech signal can be expressed as follows.

또한, 원치않는 채널 특성에 의해 변형된 스피치 신호는 다음과 같이 나타낼 수 있다.In addition, a speech signal modified by unwanted channel characteristics may be represented as follows.

이 경우에서 보듯이, 제품(product)보다는 오히려 부가(additive) 프로세스를 포함한다. 그러나, "cs" 및 "cc" 양측이 모두 고정되어 있음을 인지해야 하며,"cs"을 제거하지 않고서 "cc" 하나만을 제거할 필요가 있다.As shown in this case, it involves an additive process rather than a product. However, it should be noted that both "cs" and "cc" are fixed, and only one "cc" needs to be removed without removing "cs".

서로 다른 조건(등록 모델, 데이터베이스, 유도 코호트, 테스트 화자 등) 및 신호 cc를 제거하고자 하는 콘텍스트(context)를 고려하는 것은 중요하다.It is important to consider the different conditions (registration model, database, induction cohort, test speaker, etc.) and the context in which the signal cc is to be removed.

도 16은 화자 인식 시스템에서 음성 샘플 변조의 다양한 소스를 나타낸다. 입력 음성 신호 s(t)는 주변 배경 노이즈 b(t), 기록 장치 대역폭 r(t), 전기적인 노이즈 및 채널 크로스톡 t(t) 및 송신 채널 대역폭 c(t)에 의해 변경되어, 인식 시스템에 입력된 신호가 변경 신호 v(t)로 된다. 시스템은 주파수 영역에서 분석하기 용이하고, 검증기에서의 신호는:16 illustrates various sources of speech sample modulation in a speaker recognition system. The input speech signal s (t) is changed by the ambient background noise b (t), recording device bandwidth r (t), electrical noise and channel crosstalk t (t) and transmission channel bandwidth c (t), thus recognizing the system. The signal input to becomes the change signal v (t). The system is easy to analyze in the frequency domain and the signal from the verifier is:

v(ω) = ((s(ω) + b(ω))r(ω) + t(ω))c(ω)(수학식 A)이다.v (ω) = ((s (ω) + b (ω)) r (ω) + t (ω)) c (ω) (Equation A).

검증기에서 사람이 말하고 있을 경우 및 말하고 있지 않는 경우의 두가지 조건을 정의할 수 있다. 이들 관련 두가지 수학식은 v(ω) = ((s(ω) + b(ω))r(ω) + t(ω))c(ω) 및 v(ω) = ((0 + b(ω))r(ω) + t(ω))c(ω)이다.In the verifier, two conditions can be defined: one speaking and one not speaking. These two related equations are: v (ω) = ((s (ω) + b (ω)) r (ω) + t (ω)) c (ω) and v (ω) = ((0 + b (ω) r (ω) + t (ω)) c (ω).

먼저, 본 발명에 따른 시스템에 적용되는 되는 바와 같이, 간소화된 문제점을 고려하고, b(t) = t(t) = 0, v(ω) = s(ω)r(ω)c(ω) = s(ω)h(ω) (여기서, h( )는 조합형 채널 스펙트럼 특성임), h(ω) = r(ω)c(ω) 및 v(ω) = s(ω)h(ω) = ss(ω)sd(ω)h(ω)이라고 가정한다.First, as applied to the system according to the present invention, considering the simplified problem, b (t) = t (t) = 0, v (ω) = s (ω) r (ω) c (ω) = s (ω) h (ω) (where h () is a combined channel spectral characteristic), h (ω) = r (ω) c (ω) and v (ω) = s (ω) h (ω) Assume that ss (ω) sd (ω) h (ω).

코호트 모델은 동일한 채널 (b)를 이용하여 기록된 화자 데이터베이스로부터 선택되고, 진정한 화자 모델(true speaker model)은 상이한 채널 (a)를 이용하여 기록된다. 테스트 화자는 진정한 화자 또는 임포스터(impostor)일 수 있으며, 제3 채널 (c)를 이용하여 기록될 수 있다. 도 17은 이를 도식적으로 나타낸다. 도 18은 Cepstral 계수를 이용하는 다른 형태로 표현된 동일한 것을 나타낸다. 도 17 및 도 18에서 표현된 바와 같이 신호 성분값은 샘플 프레임 데이터의 합에 대응하는 평균이다.The cohort model is selected from the speaker database recorded using the same channel (b), and the true speaker model is recorded using different channel (a). The test speaker may be a true speaker or an impostor and may be recorded using the third channel (c). 17 shows this diagrammatically. 18 shows the same expressed in different forms using Cepstral coefficients. As represented in FIGS. 17 and 18, the signal component value is an average corresponding to the sum of the sample frame data.

v₁(τ) = cs₁(τ) + cd₁(τ) + h_a(τ)(수학식 B)로부터 확립된 식별정보 모델과 v_m(τ) = cs_m(τ) + cd_m(τ) + h_b(τ)(수학식 C)로부터 확립된 코호트 모델을 고려한다.v ₁ (τ) = cs ₁ (τ) + cd ₁ (τ) + h _a (τ) (Equation B) and v _m (τ) = cs _m (τ) + cd _m ( Consider the cohort model established from τ) + h _b (τ) (Equation C).

검증기에 대한 문제는 식별기에서 이용된 2개의 서로 다른 채널이 존재한다는 것이다. 이들 간의 차는 hd(τ) = h_a(τ) - h_b(τ) 또는 h_a(τ) = h_b(τ) + hd(τ)이라고 가정하고, 코호트 채널 (b)로 나타낸 주장된 식별정보 모델은 v₁(τ) = cs₁(τ) + cd₁(τ) + h_a(τ) = cs₁(τ) + cd₁(τ) + h_b(τ) + hd(τ) 및 v₁(τ) = (cs₁(τ) + hd(τ)) + cd₁(τ) + h_b(τ)일 것이다.The problem with the verifier is that there are two different channels used in the identifier. The claimed distinction represented by the cohort channel (b), assuming that the difference between them is hd (τ) = h _a (τ)-h _b (τ) or h _a (τ) = h _b (τ) + hd (τ) The information model consists of v ₁ (τ) = cs ₁ (τ) + cd ₁ (τ) + h _a (τ) = cs ₁ (τ) + cd ₁ (τ) + h _b (τ) + hd (τ) and v ₁ (τ) = (cs ₁ (τ) + hd (τ)) + cd ₁ (τ) + h _b (τ).

주장된 식별정보 모델의 정적 부분 평균이 채널들 사이의 차에 의해 쉬프트되고, 진정한 화자가 채널 b를 이용하여 테스트될 경우 또는 시추에이션(situation)이 수정되지 않을 경우에 에러를 야기할 수 있음을 알 수 있다. 또한, 채널 a를 이용할 경우 거짓 허용(false acceptances)과 관련한 문제가 야기될 수 있다.It is found that the mean of the static part of the claimed identification model is shifted by the difference between the channels and can cause errors if the true speaker is tested using channel b or if the situation is not corrected. Can be. In addition, the use of channel a may cause problems related to false acceptances.

이러한 문제를 해결하기 위한 한 방법은 주장된 식별정보 모델로부터 평균을 제거하는 것이지만, 단순한 평균 제거는 화자 모델의 정적 부분도 제거된 경우에우선 v₁(τ) = cd₁(τ)를 산출하게 될 것이다. 하지만, v(ω) = ((s(ω) + b(ω))r(ω) + t(ω))c(ω)(수학식 A)(부가 노이즈를 포함하는 시스템 모델)를 살펴볼때, 화자가 일시중지하는 경우를 고려할 경우, s(ω) = 0이고, 이때 v(ω) = (b(ω))r(ω) + t(ω))c(ω)이고, v(ω) = n(ω)c(ω)이다. 여기서, n(ω)는 노이즈 신호이다.One way to solve this problem is to remove the mean from the alleged ID model, but a simple mean elimination yields v ₁ (τ) = cd ₁ (τ) if the static part of the speaker model is also removed. Will be. However, when we look at v (ω) = ((s (ω) + b (ω)) r (ω) + t (ω)) c (ω) (Equation A) (system model with added noise) Considering the case where the speaker pauses, s (ω) = 0, where v (ω) = (b (ω)) r (ω) + t (ω)) c (ω) and v (ω) ) = n (ω) c (ω). Where n (ω) is a noise signal.

Cepstral 형태에서는, v(τ) = n(τ) + c(τ) = sn(τ) + dn(τ) + c(τ)이며, 여기서, 전술한 바와 같이 sn은 노이즈의 정적 부분을 나타내고, dn은 동적 부분의 합의 결과를 나타낸다.In the Cepstral form, v (τ) = n (τ) + c (τ) = sn (τ) + dn (τ) + c (τ), where sn denotes the static part of the noise, dn represents the result of the sum of the dynamic parts.

이로부터 구성된 모델 평균은 sn(τ) + c(τ)이며, 여기서 sn은 간섭 톤과 같은 임의의 고정 상태 노이즈이며, c는 채널이다.The model mean constructed from this is sn (τ) + c (τ), where sn is any fixed state noise, such as an interference tone, and c is a channel.

다시 수학식A(주장된 식별정보 모델 설정 조건)를 고려하면, v₁(τ) = cs₁(τ) + cd₁(τ) + h_a(τ)이며, 이는 노이즈가 없는 경우이다. 고정 상태 노이즈를 부가할 경우, v₁(τ) = cs₁(τ) + cd₁(τ) + h_a(τ) + sn(τ)이다.Considering Equation A (required identification model setting condition) again, v ₁ (τ) = cs ₁ (τ) + cd ₁ (τ) + h _a (τ), which is a case where there is no noise. When the fixed state noise is added, v ₁ (τ) = cs ₁ (τ) + cd ₁ (τ) + h _a (τ) + sn (τ).

이 경우에 대한 화자 일시중지 모델을 구성하면, sn(τ) + h_a(τ)를 얻을 수 있으며, 이를 이용하여 평균을 제거하면 v₁(τ) = cs₁(τ) + cd₁(τ).By constructing the speaker pause model for this case, we can obtain sn (τ) + h _a (τ), and if we remove the mean using this, v ₁ (τ) = cs ₁ (τ) + cd ₁ (τ ).

이는 채널에 의해 언바이어스된 모델을 제공한다. 유사한 프로세스는 각 모델에 적용되어, 사일런스 모델에 의해 제거된 채널 바이어스를 가지게 된다. 테스트 화자는 사일런스 모델이 채널 효과를 제거하는 데 이용되는 바와 같이 취급된다.This provides a model that is unbiased by the channel. A similar process is applied to each model, with the channel bias removed by the silence model. The test speaker is treated as the silence model is used to remove channel effects.

전술한 바와 같이, 사일런스 모델을 이용하는 채널 특성 제거(감축)는 적절한 채널 노이즈 및 발음의 사일런스 부분의 완전한 검출을 필요로 한다. 이들은 보증될 수 없으므로, 완화될 필요가 있다 (예를 들면, 사일런스가 어떤 음성을 포함하면, 주장된 식별정보의 화자 정적 음성을 포함시키고, 이를 제거한다). 유익하게도, 이들은 프로세스에 대한 하나의 단순 변형으로 처리될 수 있으며, 코호트 모델 모두는 동일한 사일런스 모델로 나타낸다.As mentioned above, channel characteristic removal (reduction) using the silence model requires full detection of the appropriate channel noise and silence portion of the pronunciation. They cannot be guaranteed, so they need to be relaxed (eg, if the silence contains some voice, include the speaker static voice of the claimed identification and remove it). Beneficially, they can be treated as one simple variant of the process, with all cohort models represented by the same silence model.

즉, (주장된 식별정보 모델을 포함하는) 코호트에서 모든 모델에 주장된 식별정보 모델의 사일런스 평균을 재부가한다. 이는 모든 모델을 동일한 평균인 sn(τ) + h_a(τ)로 나타낸다. 이러한 정규화는 테스트 모델에도 적용되어, 모든 모델 및 테스트 발음을 동일한 기준점으로 나타낸다. 사실상, 참조 채널 및 노이즈 조건을 선택하여 모든 다른 것들을 동일한 기준점으로 나타낸다.That is, in the cohort (including the claimed identification model), the silence average of the identification model claimed in all models is re-added. This represents all models with the same mean, sn (τ) + h _a (τ). This normalization also applies to test models, representing all models and test pronunciations with the same reference point. In fact, the reference channel and noise condition are chosen to represent all others with the same reference point.

이는 도 19에서 도식적으로 설명되어 있고, 분류기(110)에 입력되는 코호트 모델 1 내지 m과 주장된 식별정보 모델과 더불어 테스트 발음의 Cepstral 계수를 나타낸다. 주장된 식별정보 등록 데이터로부터 유도된 "사일런스 모델" 또는 "정규화 모델"(400)이 분류기에 입력되기 전에 이들 각각을 정규화하는데 이용되어, 분류기의 실제 입력은 정규화된 테스트 발음, 정규화된 요구된 식별 모델 및 정규화된 코호트 모델이다. 이상적으로, 정규화 모델(400)은 전술한 바와 같이 주장된 식별정보 등록 샘플에서의 사일런스 기간의 데이터에 기초한다. 하지만, 이는 완전한 요구된 식별 등록 샘플로부터 유도될 수 있다. 실제로, 정규화 모델은 Cepstral 계수의 단일 행을 포함하고, 각 계수는 주장된 식별정보 모델로부터의 Cepstral 계수의 (하나의 열의 선택된 멤버 또는) 하나의 열의 평균값이다. 이들 평균값은 입력 데이터 세트 각각의 평균값을 대치하는데 이용된다. 즉, 실례로서, 테스트 발음을 취하면, 테스트 발음 Cepstral 계수의 각 열의 평균값이 그 열의 각 개별 멤버로부터 공제되고, 정규화 모델로부터의 해당 평균값은 열의 각 개별 멤버에 부가된다. 유사한 동작이 주장된 식별정보 모델 및 코호트 모델 각각에 적용된다.This is illustrated schematically in FIG. 19 and shows the Cepstral coefficients of the test pronunciation along with the cohort models 1 to m and the asserted identification model input to the classifier 110. A "silence model" or "normalization model" 400 derived from the claimed identification registration data is used to normalize each of these prior to entering the classifier so that the actual input of the classifier is normalized test pronunciation, normalized required identification. Model and normalized cohort model. Ideally, the normalization model 400 is based on the data of the silence period in the identification registration sample claimed as described above. However, this can be derived from the complete required identification registration sample. Indeed, the normalization model comprises a single row of Cepstral coefficients, each coefficient being the average value of one column (or selected member of one column) of the Cepstral coefficients from the claimed identification model. These mean values are used to replace the mean value of each of the input data sets. That is, as an example, when test pronunciation is taken, the average value of each column of the test pronunciation Cepstral coefficients is subtracted from each individual member of the column, and the corresponding average value from the normalization model is added to each individual member of the column. Similar behavior applies to each of the claimed identification model and cohort model.

정규화 모델이 주장된 식별정보 모델, 테스트 발음 또는 코호트 모델의 임의의 것으로부터 유도될 수 있다. 모델이 주장된 식별정보 모델 또는 테스트 발음으로부터 유도되는 것이 바람직하다. 특히, 모델이 주장된 식별정보 모델로부터 유도되는 것이 가장 바람직하다. 정규화 모델은 "로(raw)" 등록 샘플 Cepstral 계수 또는 벡터 양자화 이후의 최종 모델로부터 유도된다. 즉, 이는 등록시에 유도될 수 있고, 등록 모델과 함께 저장될 수 있다. 또는, 이는 검증 프로세스의 일부분으로서 요구시 계산될 수 있다. 일반적으로, 정규화 모델은 등록시에 각 등록된 화자에 대해 계산되어 등록된 화자 데이터베이스의 일부분으로서 저장되는 것이 바람직하다.The normalization model can be derived from any of the claimed identification model, test pronunciation or cohort model. The model is preferably derived from the claimed identification model or test pronunciation. In particular, it is most desirable that the model be derived from the claimed identification model. The normalization model is derived from the "raw" registration sample Cepstral coefficients or the final model after vector quantization. That is, it can be derived at registration and stored with the registration model. Alternatively, it can be calculated on demand as part of the verification process. In general, the normalization model is preferably calculated for each registered speaker at registration and stored as part of the registered speaker database.

이들 정규화 기법은 다양한 유형의 화자 인식 시스템으로 이용될 수 있지만, 유익하게도 본 발명의 화자 인식 시스템과 결합될 수 있다.These normalization techniques can be used with various types of speaker recognition systems, but can be advantageously combined with the speaker recognition system of the present invention.

본 발명에 따른 화자 인식 시스템은 여러 이유로 개량된 성능을 제공한다.첫째로, 이용된 모델링 기법은 진정한 화자와 임포스터 사이의 분리 기능을 상당히 개선시킨다. 이러한 개선된 모델링은 시스템을 사운드 시스템(음성 샘플링 시스템)의 변화 및 (예를 들어, 감기로 인한) 화자 특성 변화와 같은 실제 문제점에 덜 민감하게 한다. 둘째로, 모델링 기법은 본래 일시적인 것이 아니어서, 일시적인 음성 변화에 덜 민감하며, 이에 따라 화자 모델의 장기간의 지속성을 제공한다. 셋째로, 필터 전처리를 사용함에 따라 모델이 가변 대역폭 조건에서 이용될 수 있다. 예를 들면, 멀티미디어 PC와 같은 고충실도 시스템을 이용하여 작성된 모델은 텔레포니 시스템과 같은 감소된 대역폭 입력 채널을 통해 수신된 입력으로 작동한다.The speaker recognition system according to the present invention provides improved performance for several reasons. First, the modeling technique used significantly improves the separation between the true speaker and the imposter. This improved modeling makes the system less susceptible to real problems such as changes in the sound system (voice sampling system) and speaker characteristic changes (eg, due to winding). Secondly, modeling techniques are not transient in nature and are less sensitive to transient speech changes, thus providing long-term persistence of the speaker model. Third, the model can be used in variable bandwidth conditions by using filter preprocessing. For example, models built using high fidelity systems such as multimedia PCs operate on inputs received through reduced bandwidth input channels such as telephony systems.

본 발명에 따른 바람직한 방법은 텍스트 종속 시스템 뿐만아니라 텍스트 독립 화자 인식 시스템에 이용하는데 적합하다.The preferred method according to the invention is suitable for use in text independent speaker recognition systems as well as text dependent systems.

시스템system

따라서, 본 발명은 협대역 또는 광대역(local or wide area) 기반에서 동작하고, 다양한 통신/입력 채널을 이용하는 유연하고, 신뢰할 수 있고, 간단한 음성 인식 시스템에 대한 기초를 제공한다. 도 16은 이더넷 네트워크와 같은 로컬 네트워크(402)와 연결되고, 라우터(404)를 경유하여 인터넷(406)에 연결된 데이터베이스 시스템 서버(400)의 사용자를 인증하기 위한 로컬 네트워크 및 인터넷을 통해 동작하는 광대역 시스템의 일례를 나타낸다. 본 발명에 따른 화자 인식 시스템을 구현하는 화자 인증 시스템 서버(408)는 데이터베이스(400)의 사용자를 인증할 목적으로 로컬 네트워크에 연결된다. 시스템의 사용자는 로컬 네트워크(402)에 직접적으로 연결될 수 있다. 보다 일반적으로, 410 및 412와 같은 사이트에서의 사용자들이 마이크로폰을 구비하여, 인터넷(406)에 차례로 연결되는 다른 로컬 네트워크에 연결된 데스크톱 또는 랩톱 컴퓨터(414 및 416)를 통해 시스템을 액세스할 수 있다. 418, 420 및 422와 같은 다른 사용자는 공중교환전화망(424) 및 인터넷 서비스 제공자(426)를 통해 다이얼-업 모뎀 접속에 의해 시스템을 액세스할 수 있다.Accordingly, the present invention provides a basis for a flexible, reliable, simple speech recognition system that operates on a narrow or wide area basis and utilizes various communication / input channels. 16 is a broadband network operating over a local network and the Internet for authenticating a user of a database system server 400 connected to a local network 402, such as an Ethernet network, and connected to the Internet 406 via a router 404. An example of a system is shown. The speaker authentication system server 408 implementing the speaker recognition system according to the present invention is connected to a local network for the purpose of authenticating a user of the database 400. Users of the system can be directly connected to the local network 402. More generally, users at sites such as 410 and 412 are equipped with microphones to access the system through desktop or laptop computers 414 and 416 connected to other local networks that in turn are connected to the Internet 406. Other users, such as 418, 420, and 422, can access the system by dial-up modem connection through public switched telephone network 424 and Internet service provider 426.

구현avatar

본 발명에 따른 화자 인식 시스템에 의해 이용된 알고리즘은 C 또는 C++와 같은 임의의 적절한 프로그래밍 언어를 이용하여 컴퓨터 프로그램으로서 구현될 수 있으며, 실행가능한 프로그램은 하드웨어/운영 체계(OS) 플랫폼에 관한 독립 애플리케이션, DSP 칩에서 내장형 코드를 포함하는 임의의 필요한 형태로 구현(하드웨어/펌웨어 구현)될 수 있거나, 운영 체계(예를 들면, MS Windows DLLs)에 내포될 수 있다. 마찬가지로, (시스템 등록 및 후속 시스템 액세스 모두를 위한) 사용자 인터페이스가 웹 기반 클라이언트 서버 시스템 및 웹 브라우저 기반 인터페이스를 포함하는 다양한 형태로 구현될 수 있으며, 여기서, 음성 샘플링 경우는, 예를 들면, ActiveX/Java 구성요소 등을 이용하여 구현될 수 있다.The algorithm used by the speaker recognition system according to the present invention may be implemented as a computer program using any suitable programming language such as C or C ++, and the executable program may be an independent application on a hardware / operating system (OS) platform. It may be implemented in any required form (hardware / firmware implementation) including embedded code in the DSP chip, or may be embedded in an operating system (eg, MS Windows DLLs). Similarly, the user interface (for both system registration and subsequent system access) can be implemented in various forms, including a web-based client server system and a web browser-based interface, where the voice sampling case is, for example, ActiveX / It can be implemented using Java components.

데스크톱 및 랩톱 컴퓨터와 별도로, 시스템은 유선 및/또는 무선 데이터/이동통신 네트워크를 통하는 팜톱 장치, WAP 인에이블 이동 전화 등을 포함하는 다른 단말 장치에 적용가능하다.Apart from desktop and laptop computers, the system is applicable to other terminal devices including palmtop devices, WAP enabled mobile phones, and the like, via wired and / or wireless data / mobile communication networks.

애플리케이션Application

본 발명에 의해 제공된 일정한 유연성 및 신뢰성을 갖는 화자 인식 시스템은 다양한 애플리케이션을 구비한다. 본 발명의 특징에 따른 한 특별한 예는 문서 또는 데이터베이스 기록과 같은 디지털 정보를 접근 및/또는 변경하는 사용자의 오디트 트레일(audit trail)을 제공한다. 이러한 트랜잭션은 본 발명이 속하는 기술분야에서 공지된 바와 같이 날짜/시간 및 사용자 식별 관련 정보를 제공하면서 기록될 수 있다. 하지만, 통상적인 시스템은 사용자 식별 정보를 정상적으로 검증 또는 인증하지는 못한다.The speaker recognition system with constant flexibility and reliability provided by the present invention has various applications. One particular example in accordance with aspects of the present invention provides an audit trail of a user who accesses and / or modifies digital information such as documents or database records. Such a transaction may be recorded while providing date / time and user identification related information as is known in the art. However, conventional systems do not normally verify or authenticate user identification information.

바람직하게도, 본 발명에 따른 화자 인식 시스템을 이용하는 화자 인식은 필요한 경우에, 예를 들면, 디지털 문서, 데이터베이스 등을 오픈 및/또는 편집 및/또는 저장할 경우에, 사용자의 식별정보를 검증하는데 이용될 수 있다. 문서 또는 기록 자체는 화자 검증 절차에 관한 데이터로 표시될 수 있거나, 이러한 데이터는 보호 문서, 기록 등에 대한 액세스 및 변경에 관한 검증된 기록을 제공하면서 별도의 오디트 트레일로 기록될 수 있다. 시스템에 의해 확인된 허가되지 않은 사용자는 시스템에 액세스할 수 없으며, 시스템 감시에 의해 시스템 관련 행위를 수행할 수 없게 된다.Preferably, speaker recognition using the speaker recognition system according to the present invention may be used to verify the identification information of the user when necessary, for example when opening and / or editing and / or storing a digital document, a database or the like. Can be. The document or record itself may be presented as data relating to the speaker verification procedure, or such data may be recorded in a separate audit trail while providing a validated record of access and changes to protected documents, records, and the like. Unauthorized users identified by the system will not be able to access the system, and system monitoring will prevent them from performing system related actions.

본 발명은 첨부된 청구범위에 정의된 바와 같은 본 발명의 범위를 벗어나지 않는 여러 개량 및 변형을 포함한다.The invention includes many improvements and modifications without departing from the scope of the invention as defined in the appended claims.

Claims

A method for processing speech samples to obtain a model of speech samples for use in a speaker recognition system,

Dividing the speech sample into a plurality of frames;

Obtaining a set of feature vectors representing each frame's smoothed frequency spectrum;

Applying a clustering algorithm to the feature vectors of the frames to obtain a reduced data set representing the original speech sample,

Said adjacent frames overlap by at least 80%.

The method of claim 1, wherein the adjacent frames overlap by 95% or less.

The method of claim 1, wherein the adjacent frames overlap by an amount ranging from 80% to 90%.

4. The method of any one of claims 1 to 3, wherein the clustering algorithm comprises a k-means algorithm or a vector quantification algorithm.

The method of claim 1, wherein the set of feature vectors representing the smoothed frequency spectrum of the frame is linear predictive coding / septal (LPCC) analysis or fast Fourier transform (LPCC). Said method using a Fast Fourier Transform (FFT) or a filter bank.

The method according to any one of claims 1 to 5, wherein the identification information of the speaker and the model of the voice sample are stored in a database of registration models of the speaker registered in the speaker recognition system. The method characterized in that it further comprises.

7. The method of claim 6, wherein each registration model comprises a plurality of speech sample models representing a plurality of different pronunciations.

8. The method as claimed in claim 6 or 7, wherein each registration model comprises a plurality of speech sample models representing the same pronunciation modeled using a plurality of parallel modeling processes.

9. The method of any one of claims 6 to 8, further comprising associating a model of the speech sample with a cohort that includes a predetermined number of other speakers registered with the speaker recognition system. Said method.

10. The method of claim 6, further comprising processing a second speech sample to obtain a test model of the second speech sample for testing with a database of the registration model.

The second voice sample processing step

Dividing the second speech sample into a plurality of frames;

Acquiring a set of feature vectors representing the smoothed frequency spectrum of the frame for each of the frames.

11. The method of claim 10, wherein the feature vector set representing the smoothed frequency spectrum of the frame of the second sample is linear predictive coding / septal (LPCC) analysis or Fast Fourier Transform (Fast Fourier Transform). Or a filter bank.

12. The method of claim 10 or 11, wherein the test model includes a plurality of speech sample models representing the same pronunciation modeled using a plurality of parallel modeling processes.

13. The method of any one of claims 10 to 12, further comprising: testing the test model against a registration model for the associated cohort and the claimed identification as defined in claim 9, wherein Said identity being tested.

In the speaker recognition method,

A plurality of speakers recognized by the speaker recognition system are registered by storing a registration model for each speaker in a database of enrolled speakers, which registers at least one voice sample from the speaker. To indicate;

The registered speakers are each associated with a cohort of a predetermined number of other registered speakers;

A test speech sample from the speaker claiming to be one of the registered speakers is modeled and tested against the registration models of the associated cohort and the claimant's registration model using a classification process;

And the classification process always matches the test model with the asserted speaker or the associated cohort such that a false acceptance rate of the system is determined by the size of the cohort.

15. The method of claim 14, wherein the modeling processes used to model the test speech sample and the registered speaker speech samples and / or the classification processes used to test the test model against the registration models are substantially equal to zero. Providing a false rejection rate such that the overall error rate of the system is substantially determined only by the false tolerance rate.

16. The test model according to claim 14 or 15, wherein the test model is tested using multiple parallel classification processes, and the test model is only tested if a predetermined number of the parallel classification processes produce a match with the registration model. Speaker matching method according to the registration model to reduce the false acceptance rate of the system for a given cohort size.

17. The apparatus of claim 16, wherein the registration models and the test model are each obtained using multiple parallel processes, wherein the parallel classification processes are applied to the registration speech samples and the results of the parallel modeling processes applied to the test speech sample. Speaker recognition method, characterized in that it is compared with corresponding results of the parallel model processes applied.

18. The apparatus of any of claims 8, 12 and 17, wherein the parallel modeling processes include: different frequency banding applied to the speech samples; Different spectral modeling applied to the speech samples; And at least one of different clusterings applied to feature vectors representing the speech samples.

17. The method of claim 16, wherein the parallel classification processes include testing the test model against different cohorts of registered speakers.

17. The method of claim 16, wherein the parallel classification processes include testing the test model for different pronunciations represented by the registration models.

A method as claimed in any one of claims 14 to 20, wherein the registration models and the test model are obtained using the method of any one of claims 1 to 13.

In the method of normalizing the speech models in the speaker recognition system of the type,

Voice samples are input to the system through different input channels having different channel characteristics,

Using a classification process, a test model representing a test sample is tested against a set of registration models representing speech samples from speakers registered with the system;

Deriving a normalization model from one of the registration speech samples or from the test speech sample, and before testing the normalized test model against a normalized registration model, the test model and the registration model to be tested. Normalizing them using the normalization model.

23. The method of claim 22, wherein the normalization model is derived from a registered speech sample for identification claimed for the test speech sample.

24. The method of claim 23, wherein the normalization model is derived from the registration model for identification claimed for the test speech sample.

25. The apparatus of any one of claims 22 to 24, wherein the speech sample is divided into a plurality of frames; A set of feature vectors representing a smoothed frequency spectrum of each frame is obtained; Wherein said normalization model is obtained by calculating average values of sets of feature vectors from at least some of the frames of speech samples used to derive said normalization model.

26. The method of claim 25, wherein the frames used to derive the normalization model are frames corresponding to silence periods in the speech sample used to derive the normalization model.

27. The method according to claim 25 or 26, wherein the test model and the registration models are normalized by replacing the average values of the registration models and the feature vectors of the test model with corresponding average values from the normalization model. Way.

28. The method of any one of claims 22 to 27, wherein the speech samples are processed using the method of any one of claims 1 to 13.

22. The method according to any one of claims 14 to 21, wherein the test model and registration models are normalized prior to classification using the method of any one of claims 22 to 28.

30. A system as claimed in any preceding claim, wherein speech samples are input to a speaker recognition system via an input channel having a transfer function to modify the speech sample data;

Estimating the transfer function of the input channel and normalizing the modified speech sample data using the inverse of the estimated transfer function.

31. A speaker recognition system comprising data processing and storage means adapted to implement the method of any one of claims 1-30.

A computer program comprising a symbol code for instructing a computer to carry out the method of any one of claims 1 to 30.

31. A data carrier encoded with a computer program comprising a symbol code for instructing a computer to carry out the method of any one of claims 1-30.